Skip to main content
PLOS Biology logoLink to PLOS Biology
. 2023 Dec 5;21(12):e3002397. doi: 10.1371/journal.pbio.3002397

Topological data analysis reveals a core gene expression backbone that defines form and function across flowering plants

Sourabh Palande 1,#, Joshua A M Kaste 2,3,#, Miles D Roberts 3,#, Kenia Segura Abá 3,#, Carly Claucherty 4, Jamell Dacon 5, Rei Doko 5, Thilani B Jayakody 4, Hannah R Jeffery 4, Nathan Kelly 6, Andriana Manousidaki 7, Hannah M Parks 2, Emily M Roggenkamp 4, Ally M Schumacher 3, Jiaxin Yang 1, Sarah Percival 2, Jeremy Pardo 3, Aman Y Husbands 8, Arjun Krishnan 1,2, Beronda L Montgomery 2,9,10, Elizabeth Munch 1,11, Addie M Thompson 4,12, Alejandra Rougon-Cardoso 13,14, Daniel H Chitwood 1,6,*, Robert VanBuren 6,12,*
Editor: Hajk-Georg Drost15
PMCID: PMC10723737  PMID: 38051702

Abstract

Since they emerged approximately 125 million years ago, flowering plants have evolved to dominate the terrestrial landscape and survive in the most inhospitable environments on earth. At their core, these adaptations have been shaped by changes in numerous, interconnected pathways and genes that collectively give rise to emergent biological phenomena. Linking gene expression to morphological outcomes remains a grand challenge in biology, and new approaches are needed to begin to address this gap. Here, we implemented topological data analysis (TDA) to summarize the high dimensionality and noisiness of gene expression data using lens functions that delineate plant tissue and stress responses. Using this framework, we created a topological representation of the shape of gene expression across plant evolution, development, and environment for the phylogenetically diverse flowering plants. The TDA-based Mapper graphs form a well-defined gradient of tissues from leaves to seeds, or from healthy to stressed samples, depending on the lens function. This suggests that there are distinct and conserved expression patterns across angiosperms that delineate different tissue types or responses to biotic and abiotic stresses. Genes that correlate with the tissue lens function are enriched in central processes such as photosynthetic, growth and development, housekeeping, or stress responses. Together, our results highlight the power of TDA for analyzing complex biological data and reveal a core expression backbone that defines plant form and function.


Topological Data Analysis shows how gene expression shapes the form and function of flowering plants, which have adapted to dominate Earth’s varied landscapes over 125 million years. The findings reveal consistent gene patterns across these plants, illuminating their responses to stress and providing a novel way to explore complex biological data.

Introduction

Over 300,000 gene expression datasets have been collected for thousands of diverse plant species spanning over 900 million years of divergence [1]. This wealth of publicly available datasets spans ecological niches, species, developmental stages, tissues, stresses, and even single cells, providing a largely untapped reservoir of biological information. These diverse datasets provide an opportunity to link insights from various biological disciplines, including ecology, development, physiology, genetics, evolution, biochemistry, and cell biology through a common computational and mathematical framework. These gene expression datasets have been analyzed individually for specific experiments and hypotheses, but large-scale meta-analyses across the publicly available expression datasets are largely nonexistent for plants.

Beyond a common currency that links the subdisciplines of biology, gene expression links its emergent levels. Below gene expression, the genome gives rise to transcriptional networks and protein interactions that are directly responsible for the complexity of gene expression. Above it, gene expression orchestrates cell-specific expression and the development of the organism itself, impacting phenotypes ranging from physiology to plasticity that propagate further to the population, community, and ecological levels. These features, from molecular (DNA, promoter sequences, -omics datasets) to the organismal, population, and ecological levels (life history traits, climatic data from species distributions, etc.) have been used in the past as labels and predicted outputs of machine learning models [2,3]. The structure—the shape—of gene expression in flowering plants is therefore a constraint that is formed by and impacts biological phenomena below and above it, respectively.

Data visualization lies at the heart of exploratory data analysis and provides us with a powerful tool for generating hypotheses that can later be examined using standard statistical techniques. In the era of Big Data, the development of new data visualization pipelines has become increasingly important due to the high dimensionality of the datasets generated and the need to identify patterns and structures that can then become targets for more focused studies. Just as we can look upon the shape of a leaf and derive insights into how it functions from multiple perspectives (developmental, physiological, and evolutionary), we can visualize the shape of any type of data using a Mapper graph [4]. The Mapper algorithm takes as input a filter function that describes a biological aspect of the data and uses mathematical ideas of shape to return a graph that reveals the underlying structure of the data. Even abstract data types like gene expression datasets, therefore, have a shape that we can visualize and derive insights from. For example, Nicolau and colleagues visualized the structure of breast cancer gene expression, identifying 2 distinct branches with differing underlying genotypes and prognostic outcomes that traditional statistical and bioinformatic approaches fail to resolve [5]. This structure was revealed using a pairwise correlation distance matrix as input and modeling of the residuals of each sample from a vector of healthy gene expression as a measure of disease severity. In a second example, using a lens of developmental stage on single-cell RNASeq data, Rizvi and colleagues visualized the underlying structure of gene expression during murine embryonic stem cell differentiation, revealing transient states as well as asynchronous and continuous transitions between cell types [6]. In both examples, Mapper allowed the shape of data, through a selected lens, to be visualized. The resulting topology of the graph—in the form of loops, branch points, or flares—allowed previously hidden structures to be seen and novel insights to be derived. Loops, branch points, and flares in topological data analysis (TDA)-based Mapper graphs are visual representations of patterns, transitions, and outliers in the data. They provide insights into the topological structure and organization of the data, helping to identify clusters, subgroups, and potential anomalies. Loops represent recurring patterns or relationships in the data, branch points occur when different subsets of data points exhibit distinct topological characteristics, and flares typically indicate outliers or subgroups within a larger cluster and can help identify regions of interest or anomalous behavior in the data.

Surveys of gene expression capture tens of thousands of data points per sample, and this high dimensionality can be represented by a unique shape that underlies emergent biological features. This shape explains gene expression along evolutionary, developmental, and environmental trajectories, leading to innovations that have marked the successful adaptation and proliferation of plant species. To visualize this shape is to better understand what transcriptional profiles are possible and to know the boundaries or constraints that permit or limit gene expression. Here, we analyzed publicly available gene expression profiles across diverse flowering plant families and visualized the underlying structure of gene expression in plants as a graph using the Mapper algorithm. We identified unique topological shapes of plant gene expression when viewed through lenses that delineate different tissue or stress responses. These complex, emergent patterns were largely hidden by biological complexity and sample heterogeneity. Our results demonstrate the ability of Mapper to uncover these patterns in high-dimensional plant gene expression datasets and its potential as a powerful tool for biological hypothesis generation.

Results

A representative catalog of flowering plant gene expression

The vast number of gene expression datasets in plants provides a unique opportunity to search for patterns of conservation and divergence throughout angiosperm evolution, across developmental time, tissues, and stress response axes. Previous studies have tried to find common signatures that define different plant tissues or responses to abiotic/biotic stresses, but these have been limited in species breadth [7], depth [8], or had limited downstream analyses [9]. Here, we reanalyzed public expression data on the NCBI sequence read archive (SRA) and applied a topological data analysis method to map the shape of gene expression in plants. We included 54 species that captured the broadest phylogenetic diversity within angiosperms while maximizing the breadth of expression at the tissue and stress levels (Fig 1A). This includes 44 eudicots across 13 families and 9 monocot species across 2 families, as well as Amborella trichocarpa, which is sister to the rest of angiosperms. Raw reads were downloaded, cleaned, and reprocessed through a common RNAseq pipeline to remove artifacts related to the different algorithms and downstream analyses used by each group. After filtering datasets with low read mapping, our final set of expression data includes 2,671 samples across 7 distinct developmental tissues and 9 stress classifications for 54 species.

Fig 1. Dimensional space of plant gene expression across evolution, development, and stress.

Fig 1

(A) Representative phylogeny of the 54 plant species included in this study. Nodes (species) are colored by plant family as denoted in Fig 1C. Dimensionality reduction of all samples by principal components (left) and t-SNE (right) are shown for tissue type (B), plant family (C), and abiotic/biotic stress (D). Individual samples are quantified and colored by tissue, family, and stress as shown in the respective bar plots. (E) Hierarchical clustering of samples with various biological features highlighted (stress, family, and tissue). Raw expression data underlying the graphs in this figure can be found in S7 Dataset, and code to regenerate analyses can be found in https://zenodo.org/records/8428609 [65].

To facilitate comparisons of gene expression across species, we limited our analysis to a set of 6,328 orthologous low-copy genes that were conserved across all 54 plant species using Orthofinder [10]. These sets of orthologous genes or orthogroups are mostly single copy in our diploid species and scale with ploidy in polyploid species. The orthogroups are conserved across a diverse selection of Angiosperm lineages and correspond to well-conserved biological processes. Gene ontology (GO) term enrichment analysis on the Arabidopsis thaliana loci associated with these orthogroups show enrichment for basic biological functions like “DNA replication initiation” and “tRNA methylation” at the top of the list of enriched GO terms, as well as functions specific to photosynthetic organisms like “photosystem II assembly,” and “tetraterpenoid metabolic process.” Although the remaining orthogroups contain significant biological information, they were excluded from analysis as multigene families typically have diverse functions with divergent expression profiles that would conflate downstream comparative analyses.

The transcript per million (TPM) counts were summed for all genes within an orthogroup for a given species and merged into a single dataframe to create a final matrix of 6,335 orthologs by 2,671 samples. Principal component analysis (PCA) [11] and t-distributed stochastic neighbor embedding (t-SNE) [12] based dimensionality reduction show some separation of samples by different biological factors (Fig 1). The sample space is most clearly delineated by tissue, where both PC1 (explaining 25.4% variation) and t-SNE1 separate the samples into a gradient from root to leaf tissues with other plant tissues sandwiched in between (Fig 1B and 1D). This distribution largely correlates with tissue function, as the sink tissues of flowers, seeds, and fruits resolve closer to the root samples along t-SNE1 and PC1. No tissue type is separated fully by either dimensionality reduction approach. Samples from the 16 plant families are distributed throughout the dimensional space, suggesting that family- or species-level traits are not masking emergent features of distinct tissues (Fig 1C). Interestingly, abiotic and biotic stresses are similarly distributed throughout the dimensional space, with no clear grouping of the same stress across species or individual experiments. This could be due to intrinsic differences in how individual species respond to stress or to differences in the way stress experiments are carried out by different research groups. To account for batch effects and the influence of unmodeled factors, we applied surrogate variable analysis (SVA) to generate estimates of surrogate variables and their effects on our expression matrices. We identified 24 surrogate variables within the dataset, but these latent variables were intrinsically linked to the primary factors in our study (e.g., stress, tissue, and family). Removing surrogate variables would have masked much of the biology we were attempting to quantify, so we chose not to use these “data cleaning” approaches (see Text A in S1 Text for more details).

Topological data analysis and the shape of plant gene expression

Traditional dimensionality reduction and hierarchical clustering provided some degree of separation, but they were unable to delineate samples by stress or to identify expression patterns related to biological function. This may be related to residual heterogeneity, noise, or because of the inherent biological complexity that underlies plant evolution and function. To test these possibilities, we used a topological data analysis approach to map the shape of our data. TDA was implemented using Mapper [13], which provides a compact, multiscale representation of the data that is well suited for visual exploration and analysis. Mapper is particularly well suited for genomics data as these datasets typically have extremely high dimensionality and sparsity [5]. To construct mapper graphs from our gene expression data, we created 2 different lenses of tissue and stress, adopting an approach similar to Nicolau and colleagues’ (Fig 2A–2E). To create the stress lens, we first identified all the healthy samples from the dataset and fit a linear model to them (Fig 2; see Methods). This model serves as the idealized healthy orthogroup expression. We then projected all the samples onto this linear model and obtained the residuals. These residuals measure the deviation of the sample gene expression from the modeled healthy expression, and the lens function is simply the length of the residual vector.

Fig 2. Topology-based Mapper graphs and the shape of gene expression in plants.

Fig 2

Overview of Mapper graph construction and lens functions (A-E). The lens function value of each sample is shown in the principal component (top) and t-SNE (bottom) based dimensional reduction from Fig 1 for the tissue (F) and stress lens (G). Mapper graphs across variable cover intervals and interval number for the tissue (H) and stress (I) lens function. The Mapper graph constructions we chose for further analysis are enclosed within a box. Raw expression data underlying the graphs in this figure can be found in S7 Dataset, and code to regenerate analyses can be found in https://zenodo.org/records/8428609 [65].

The obvious separation between leaf and root samples in the dimension reduction plots supports a strong photosynthetic versus nonphotosynthetic divide. We used this observation to create a binary tissue lens in the same way as the stress lens. We identified all the photosynthetic samples (i.e., leaf tissue) and created an idealized expression profile by fitting a linear model to these expression profiles (Fig 2). We then projected all the samples onto this linear model and obtained the residuals to establish the lens function by tissue. To define the cover for each lens, we divided the range of the lens function into intervals of uniform length, with the same amount of overlap between adjacent intervals. We experimented with a range of value lengths of the intervals and the size of the overlap to identify the values that produced relatively stable mapper graphs. The clustering was performed using DBSCAN, a commonly used clustering algorithm in Mapper [14].

Overlaying the tissue lens value of each sample over the PCA and t-SNE dimensional space reveals a clear gradient across PC1 and t-SNE1, with the highest lens function values found in seed, fruit, and flower tissues (Fig 2F). For the stress lens function, samples are distributed across the dimensional space, with no obvious correlation between healthy and stressed lens values, similar to the observation from individual abiotic/biotic stresses (Figs 1D and 2G).

Mapper graphs for the tissue and lens functions reflect an emergent and striking topological shape of plant expression (Fig 2H and 2I). Each node in the Mapper graphs corresponds to a bin of similar RNAseq samples with color representing the average lens value of samples within each node. Edges (connections) show common samples between overlapping bins. Changing the cover interval overlap and interval number has marginal effects on the core graph structure but changes the shape and connectivity of sparse nodes on the outskirts of the graphs (Fig 2H and 2I). This central stability highlights the robustness of our input data and significance of the underlying features defining the graph shape [15]. The Mapper graphs for both the tissue and stress lens functions show a backbone structure with numerous embedded nodes and flares that form a well-defined gradient from leaf to seed or healthy to stressed, respectively. This suggests that there are distinct and conserved expression patterns across angiosperms that delineate different tissues or responses to biotic and abiotic stresses.

Our input dataset is unbalanced, with large discrepancies in the number of input samples for different species, stresses, or tissue types. We tested if biases in the distribution of samples could explain the topological shape we observed. We downsampled the most frequent factor combinations and surveyed the effect it had on the Mapper graph topology. Our study has 3 factors: family, tissue, and stress with 16 families, 8 tissue types, and 10 stresses. In total, 1,280 unique 3-way combinations are possible (family + tissue + stress), but in our dataset, only 195 unique combinations are present and they have a heavily skewed distribution (Fig A in S1 Text). Based on this distribution, we chose a cutoff of 30 and downsampled the 30 most common factor combinations. This significantly reduced the sampling bias for family, tissue, and stress, but it did not eliminate them (Fig B in S1 Text). We then reran the Mapper algorithm using this downsampled dataset. The topology is quite similar, suggesting that biases in sample representation are not the major factor underlying the patterns we observed (Fig C in S1 Text).

Topological shape reflects the underlying biological features of gene expression

To identify and characterize these conserved biological patterns, we first simplified the Mapper graphs into 18 nodes for both the tissue and stress lens functions (Figs 3 and 4). The core tissue-based Mapper graph has discrete nodes for each surveyed plant tissue with a gradual transition of leaves (node 1), to roots (2), fruits (11 and 13), and, finally, seeds (14, 15, and 16; Fig 3A). At the fourth node, the Mapper graph proliferates into terminal branches of flower (node 9), stem (10), fruit (12), and mixtures of uncategorized tissue types (5 and 8). RNAseq samples from the 16 angiosperm families are largely dispersed across nodes by tissue, with some notable exceptions (Fig 3B). Most fruit samples are found along the gradient of the core graph structure, but fruits from the rose (Rosaceae) family form a separate node (node 12). Flowers from the eudicot species are mixed with fruit tissues in nodes along the core graph structure, but monocot flowers from the grass family (Poaceae) are found in discrete, branching nodes (9 and 17). The biotic and abiotic stress RNAseq samples are dispersed by tissue across the Mapper graph (Fig 3C), supporting the complexity and heterogeneity of these samples.

Fig 3. Simplified Mapper graphs detailing the distribution of samples along the tissue lens.

Fig 3

Nodes along the full Mapper graphs (left) are clustered into simplified Mapper graphs (right), and samples are colored by tissue (A), family (B), and stress category (C). Photosynthetic and nonphotosynthetic ends of the Mapper graph are indicated.

Fig 4. Simplified Mapper graphs detailing the distribution of samples along the stress lens.

Fig 4

Nodes along the full Mapper graphs (left) are clustered into simplified Mapper graphs (right) and samples are colored by tissue (A), family (B), and stress category (C). Healthy and stressed ends of the Mapper graph are indicated.

Mapper graphs clearly distinguish tissues across plant taxa, but what are the biological features that underlie this topology? We surveyed the expression patterns of the 6,328 orthogroups used to generate our Mapper graphs to see if they are enriched in certain biological processes related to evolutionarily conserved, tissue-specific functions. We classified genes as positively or negatively correlated with the tissue lens and conducted GO enrichment in these groups of genes. We expect negatively correlated genes to be characteristic of leaf gene expression and positively correlated genes to be characteristic of non-leaf gene expression. Supporting this, Mapper graphs and GO terms associated with the tissue lens–correlated genes point to photosynthetic versus nonphotosynthetic metabolism as a key factor in the overall gene expression patterns of plant tissues (Fig 3 and S1 Dataset). Enriched negatively correlated GO terms are mostly related to photosynthesis and include response to red and blue light, chloroplast and thylakoid organization, carotenoid metabolic process, and regulation of photosynthesis among others (S1 Dataset). Plants and green algae are characterized by a set of well-conserved genes that are not found in nonphotosynthetic organisms termed “the GreenCut2 inventory” [16]. Most of the GreenCut2 genes (421 out of 677) are found within the 6,328 orthogroups in our analysis, and we tested if these are enriched among correlated genes. Genes from the GreenCut2 inventory are overrepresented in this set of genes, with 26.7% of the tissue-correlated (positively or negatively) genes being in the GreenCut2 resource versus 6.7% of the entire set of orthogroups (Table A in S1 Text). This overrepresentation is even more stark if we delimit our analysis to only the genes negatively correlated with the tissue lens, of which 50.3% are in the GreenCut2 inventory. The overlapping loci between the 2 sets contain genes encoding protein products involved in various aspects of photosynthesis, including pigment biosynthesis and binding (e.g., AT4G10340, AT1G04620, AT1G44446) [1719], the operation of the photosynthetic light reactions (e.g., AT4G05180, AT5G44650, AT3G17930) [2022], or the operation of the Calvin–Benson Cycle (AT1G32060) [23].

Enriched GO terms that are positively correlated with the tissue lens are largely related to housekeeping and core metabolic processes including ubiquitination, macromolecule catabolism, the electron transport chain, peptide biosynthesis, and Golgi vesicle–mediated transport among many others (S2 Dataset). Enriched genes include proteins involved in the TCA cycle and respiration (e.g., AT1G47420, AT2G18450, AT4G26910) [2426] and in the development of specific nonphotosynthetic tissue types like seeds (e.g., AT2G40170, AT2G38560) [27,28] and pollen/pollen tubes (e.g., AT2G03120, AT2G41630) [29,30]. However, many of the tissue lens–correlated genes do not intuitively relate to the photosynthetic versus nonphotosynthetic tissue distinction, and further examination of these loci on a gene-by-gene basis may shed light on conserved differences between plant tissues.

The simplified Mapper graph from the stress lens has 18 nodes that form a continuous gradation of healthy to stressed tissues (Fig 4). Individual tissue types, regardless of stress condition, are enriched in certain nodes but are less defined than under the tissue lens (Fig 4A). RNAseq samples related to light and heat stress are found in discrete nodes (1 and 2, respectively) at the terminus of the Mapper graph across all species where these data were available (Fig 4C). Other stress RNAseq samples are found in nodes with healthy tissues but are generally concentrated toward the stress end of the Mapper graph. An interesting exception is a group of cold stressed root samples from the grass (Poaceae) family (node 15). Clustering of distinct stresses within the same node suggests a core stress response conserved across Angiosperms for all abiotic and biotic factors. The gradient of sample distribution from healthy to stressed across the Mapper graph may be related to the severity of stress experienced by plants in each individual experiment.

To explore what constitutes these conserved stress-related expression patterns, we searched for GO enrichment of genes that are positively correlated with the stress lens. This group of genes is heavily enriched in functions related to stress, including responses to water deprivation, chitin, reactive oxygen species, fungi, wounding, bacteria, and general defense mechanisms (S3 Dataset). Genes positively correlated with the stress lens include loci related to the biosynthesis of compounds with diverse stress-related activities like jasmonic acid and jasmonic acid derivatives (AT2G35690, AT2G46370) [31,32] and ascorbic acid (AT3G09940) [33]. Negatively correlated genes are enriched in functions related to growth and reproduction such as DNA replication, mitosis, and rRNA processing, among others (S4 Dataset). This includes genes involved in regulation of the cell cycle (AT3G54650, AT4G12620, AT2G01120) [3436], chromatin organization (AT1G15660, AT1G65470) [37,38], and the development of reproductive structures (AT1G34350, AT2G41670, AT4G27640, AT3G52940) [3942]. This pattern points towards an intuitive distinction between the stressed and unstressed samples in our dataset in terms of their investment in cell proliferation and reproduction. Most of these genes are involved in core biological functions with conserved roles across eukaryotes, and their coordinated perturbation could be predictive of stress responses in diverse lineages.

Discussion

Genome-scale datasets have high dimensionality, and even the simplest pairwise experiment has hundreds or thousands of complex and interconnected cellular pathways in dynamic flux between conditions. Comparisons across plant lineages are similarly complex, as each species has its own evolutionary history with thousands of duplicated, lost, or new genes enabling its unique and elegant biology. This complexity presents major challenges for characterizing underlying biological mechanisms and identifying shared and distinct properties across evolutionary timescales. Here, we leveraged the wealth of public gene expression datasets across diverse flowering plants and used a set of deeply conserved genes to search for patterns of conservation across tissue types, stress responses, and evolution. We first tested traditional dimensionality reduction and clustering-based approaches but found that they were largely ineffective and unable to clearly resolve samples. Instead, we used a novel topological framework to compare samples and test for evolutionary conservation.

Topological data analysis has been applied to complex, high dimensionality biological datasets including gene expression profiles correlated with human cancers and other diseases [5,43,44]. To our knowledge, TDA has not been used for plant science datasets outside of shape [4547]. Flowering plants have tremendous phylogenetic, developmental, phenotypic, and genomic scale diversity, creating additional layers of complexity compared to other lineages. Despite this, Mapper was able to capture hidden and emergent signatures of gene expression at the tissue and stress scales that were missed using traditional approaches. Most developmental tissues or stress responses are not perfectly separated but instead fall within a gradient along a central shape. The central shape of the tissue lens Mapper graph represents the life cycle of a plant with transitions from the vegetative tissues of leaves and roots to reproductive flowers, fruit, and, eventually, seeds. Nodes along the Mapper graphs that contain mixtures of tissues such as fruits and flowers, leaves and stems, or even leaves and roots reflect developmental plasticity, heterogeneity, and overlapping functions between different organs. Flowers give rise to fruits and the complex processes of fertilization, seed, and fruit development blur the lines between distinct tissue types. This complexity and interconnectivity is central to biological processes but is masked by traditional dimensionality reduction approaches, which can oversimplify nonlinear datasets.

The stressed and healthy samples are less clearly delineated in the Mapper graphs than samples from different plant tissues. This may reflect artifacts stemming from variation in the severity, duration, or method of applying stresses across different experiments and species. For example, mildly stressed samples might have expression signatures that mirror healthy tissues with comparatively few differentially expressed genes. Despite this issue, we observed a strong gradient of sample distribution from healthy to stressed across the graph. Distinct stresses were generally found within the same nodes, and genes that were positively correlated with the stress lens show enrichment in classical stress pathways. This includes the core stress-responsive hormones jasmonic acid and abscisic acid and their corresponding transcriptional network as well as broader shifts in metabolic processes geared toward defense. Taken together, this suggests that plants have deeply conserved expression signatures across evolution and for different stresses. Abiotic and biotic stress responses have been mostly studied in isolation, but they typically co-occur in natural environments, and they have overlapping signaling, hormonal, and network responses in plants (reviewed in [48]). The topological shape of gene expression points to a shared set of pathways or perturbations that define if a tissue is healthy or stressed. Environmental stresses broadly disrupt photosynthesis and core metabolic and cellular functions either as a direct response to physical trauma or in preparation for defense or resilience. These changes may serve as the backbone of the topological shape we observed for the stress lens.

Although we observed a deeply conserved pattern of gene expression underlying plant form and function, our analyses capture a snapshot of the evolutionary innovations found in flowering plants. We used a set of low-copy, conserved genes to enable comparisons of expression across species, and we had to exclude around approximately 70% of all plant genes. This includes most enzymes, transcription factors, and regulatory elements, which are mostly found in large, rapidly evolving, or lineage-specific gene families that cannot be resolved to high-confidence orthologs across eudicots and monocots. Duplication and subsequent sub- or neofunctionalization of these genes drive the evolution of new plant traits and developmental differences of plant organs. Single-copy genes by contrast have deeply conserved functions in core metabolism, photosynthesis, and housekeeping processes that typically transcend tissue, species, and environmental changes. Given these limitations, it is somewhat surprising that our analyses were able to clearly separate tissue types and stresses despite missing information from most of the genes that should underlie these biological differences. Applying TDA with a full set of genes in a single species with well-curated gene expression profiles could uncover complex or emergent biological signatures that were previously hidden.

Here, we provide a proof of concept for studying complex biological traits using TDA, and a similar analytical framework could be applied to numerous areas of plant science research and beyond. Compared to the approximately 300,000 published plant gene expression datasets [1], our study has a somewhat sparse sampling of species and a subset of expressed genes, yet we were able to detect a number of hidden trends. TDA of high-resolution sampling over narrower phenotypic spaces such as drought responses in a single species or tissue divergence across 900 million years of plant evolution could yield transformative insights that were previously overlooked. However, researchers should exercise caution when applying TDA to gene expression data as the lack of a robust hyperparameter tuning procedure could potentially result in misleading conclusions. This reflects a broader problem in machine learning and data science, but hyperparameter search, cross-validation, and feature selection can enable data-driven tuning of the appropriate hyperparameters. With the appropriate datasets and sufficient sampling, TDA can be widely applicable for developing a deeper understanding of complex, emergent biological phenomena.

Methods

Assembling a representative catalog of flowering plant expression data

We selected species that captured the broadest phylogenetic diversity within angiosperms and species that had a breadth of expression at the tissue and stress levels. We also selected only species with a high-quality reference genome to enable accurate read mapping and downstream comparative genomics. Metadata including species, accession, tissue type, experimental treatments, replicate number, and sequencing platform were collected manually for each sample using the NCBI BioProject and SRAs, as well as the primary data publications (S6 Dataset). Raw RNAseq reads were downloaded from the NCBI SRA and quantified using a pipeline developed in the VanBuren lab to trim, quantify, and identify differentially expressed genes (https://github.com/pardojer23/RNAseqV2). Using a common analytical pipeline helped reduce noise between experiments that used different algorithms in the original publications. Raw Illumina reads from various platforms were first quality trimmed using fastp (v0.23) [49] with default parameters. The quality filtered reads were pseudoaligned to the corresponding transcripts (gene models) for each species using Salmon (v1.6.0) [50] with the quasi-mapping mode. Transcript-level estimates were converted to gene-level transcript per million counts using the R package tximport [51].

Comparing expression across species

To facilitate detailed cross-species comparisons, we first clustered proteins from all 54 species into orthogroups using Orthofinder (v2.3.8) [10]. Genomes and proteomes were downloaded for each species from Phytozome v13 [52]. Orthofinder was run using default parameters and the reciprocal DIAMOND search (v2.0.11) [53] was used for sequence alignment, and groups of similar proteins were clustered using the Markov Cluster Algorithm. In total, 2,317,289 genes (94% of input genes) were clustered into 86,185 orthogroups across the 54 species. Of these, 33,585 orthogroups are found in only a single species and 7,742 are found in at least 52 out of 54 species. This set of broadly conserved orthogroups was further refined by filtering out orthogroups with an average of >2 genes per ortholog for the diploid species to avoid including multigene families with diverse functions in the analysis. This set of 6,335 orthogroups was used as a common framework to allow comparison of expression across species. For orthogroups where a species had more than one gene, the total TPM for all genes in that orthogroup was summed and the raw TPM was used for single-copy genes. Expression data for each sample across all species were combined into a single expression matrix (S7 Dataset), and SVA was used to characterize the potential impacts of unmodeled technical variables on the dataset (see Text A in S1 Text). PCA was performed using built-in functions in Scikit-learn [54] on the log2+1 or z-score transformed gene expression data (raw TPMs) to reduce dimensionality and capture the main sources of variation within the datasets.

Surrogate variable analysis

To account for batch effects and the influence of unmodeled factors on the expression matrix used for the present study, we applied SVA to generate estimates of surrogate variables and their effects on our expression matrices [55,56]. Briefly, SVA assumes that the expression of a particular gene i across j independent RNA-seq experiments can be described by the following linear equation:

xij=ui+fi(yj)+eij (1)

where ui is the baseline expression level of gene i, fi(yj) represents the effect of a measured variable yj, and eij is the error term [55]. However, if there are a number of L unmodeled factors affecting the expression of gene i, then the error term eij contains both randomly distributed experimental error as well as the effects of unmodeled factors. That is:

eij=ΣlLyligij+eij (2)

where gl = (gl = (gl1,…,gln) is a function describing the effect of all unmodeled factors up to L, yli is the coefficient describing the influence of an unmodeled factor l on the expression of gene i, and eij is the true randomly distributed noise term [55]. Combining (1) and (2) yields:

xij=ui+fi(yi)+ΣlLyligij+eij (3)

By using the svaseq() method implemented in the R package sva (v. 3.36.0) [56,57], we identified and estimated the values of 24 separate surrogate variables. These surrogate variables, which correspond to vectors of values for each expression value xij, in the ΣlLyligij+eij term in (3).

To determine the amount of variation due to a proxy batch variable (bioproject), 3 biological primary variables (stress, tissue, and family), and the pairwise interactions each surrogate variable explains, we regressed all the estimated surrogate variables on each variable (either batch or biological) or on a pairwise interaction. McNemar’s formula was used to calculate the adjusted R2 values for each surrogate variable.

Mathematical basis of topological data analysis

The flexibility of Mapper allows us to apply it to various types of data. Here, we will describe the Mapper construction in the simplest setting of point cloud data and then explain how it was applied to the gene expression data.

Consider a point cloud XRd equipped with a function f: XR. An open cover of X is a collection U = {Ui}i∈I of open sets in Rd, such that X ⊂ ⋃ i∈I Ui, where I is an index set. The 1-dimensional nerve of the cover U, denoted as M: = N1(U), is called the Mapper graph of (X, f). In this graph, each open set Ui is represented as a vertex i, and 2 vertices, i and j, are connected by an edge if and only if the intersection of Ui and Uj is nonempty.

To construct a Mapper graph, we start by defining a cover V = {Vj} j∈J of the image f(X) ⊂ R of f, where J is a finite index set, by splitting the range of f(X) into a collection of overlapping intervals. Next, for each Vj, we identify the subset of points Xj in X such that f(Xj) ⊂ Vj and apply a clustering algorithm to identify clusters of points in Xj. The cover U of X is the collection of such clusters induced by f−1(Vj) for each j. Once we have the cover U, we compute its 1-dimensional nerve M and visualize it in the form of a weighted graph.

For example, consider Fig 2A–2E. The point cloud X in this case consists of points in the 2-dimensional plane, in the shape of a “Y”. The function f simply maps each point to its y-coordinate. We divide the range of f into 4 overlapping intervals, represented by the 4 colored segments along the y-axis in Fig 2. For each interval Vj, the colored rectangles in the center panel of the figure show the subsets of points XjX such that Xj = f−1(Vj). Then, we apply clustering to each Xj separately to obtain the cover U of X. The 1-dimensional nerve of U, i.e., the mapper graph M, is shown in the rightmost panel. The color of each vertex corresponds to the cover interval it belongs to. Fig 2A–2E illustrates mapper graph construction from the same set of points, but with x-coordinate used as the lens. We can observe that the 2 lens functions produce 2 slightly different mapper graphs.

Constructing Mapper graphs and lens functions

To construct Mapper graphs from our gene expression data, we create 2 different lenses, adopting an approach similar to the one used in Nicolau and colleagues’ paper. We refer to these lenses as the tissue lens and the stress lens, respectively. To create the stress lens, we first identified all the healthy samples from the dataset and fit a linear model to them. This model serves as the idealized healthy orthogroup expression. Then, we project all the samples (healthy as well as stressed) onto this linear model and obtain the residuals. These residuals measure the deviation of the sample gene expression from the modeled healthy expression. The lens function is simply the length of the residual vector. To define the cover, we divide the range of the lens function into intervals of uniform length, with the same amount of overlap between adjacent intervals. We experimented with a range of values length of the intervals and the size of the overlap to identify the values that produced relatively stable Mapper graphs. The clustering was performed using DBSCAN, a commonly used clustering algorithm for Mapper.

The construction of Mapper graph relies on several user-defined parameters: the lens function f, the cover V, and the clustering algorithm. Optimizing these parameters is an interesting open problem in TDA research [58]. The function f plays the role of a lens, through which we look at the data, and different lenses provide different insights [4]. The choice of f is typically driven by the domain knowledge and the data under consideration. In this study, the data under consideration are very similar to the dataset studied by Nicolau and colleagues [5]. Therefore, we followed similar methods to define the lenses. Our choice of lenses is further justified by the observations from the dimension reduction plots.

The cover V = {Vj}j∈J of f(X) consists of a finite number of open intervals as cover elements. To define V, we use the simple strategy of defining intervals of uniform length and overlap. Adjusting the interval length and the overlap increases or decreases the amount of aggregation provided by the Mapper graph. The optimal choice was made by visually inspecting Mapper graphs over a range of parameter values. The parameters resulting in the most stable structure were selected. Any clustering algorithm can be employed to obtain the cover U. We use the density-based clustering algorithm, DBSCAN [59], which is commonly used in Mapper because it does not require a priori knowledge of the number of clusters. Instead, DBSCAN requires 2 input parameters: the number of samples in a neighborhood for a point to be considered as a core point, and the maximum distance between 2 samples for one to be considered in the neighborhood of the other.

Functional annotation of orthogroups

The correlation between expression values and tissue lens and stress lens values was calculated for each orthogroup. The top 2.5% most positively and negatively correlated orthogroups for each lens were selected to represent the tissue lens or stress lens correlated orthogroups. Arabidopsis gene IDs were used to identify the overlap between the GreenCut2 [16] inventory with Arabidopsis orthologs in our overall set of orthogroups, as well as our sets of tissue lens and stress lens correlated orthogroups. The binom_test() function from SciPy [60] was used to apply one-sided binomial tests to check for enrichment of GreenCut2 loci in the overall, tissue lens, and stress lens correlated orthogroup sets. GO term enrichment of the sets of genes mapped to orthogroups and correlated with the tissue lens or stress lens was done using GOATOOLS [61]. Data on gene function and biochemical reactions associated with specific loci were derived from TAIR [62], KEGG [63], and a genome-scale metabolic model of Arabidopsis metabolism from [64].

Supporting information

S1 Text

Fig A. Histogram of 3-way factors of the RNAseq samples before and after downsampling. The distribution of 3-way factors for family, tissue, and stress is plotted. The 16 families, 8 tissue types, and 10 stresses equate to 1,280 unique 3-way combinations, but we only observed 195 unique combinations in our dataset. The distribution of samples from the entire dataset is shown on the left, and the distribution of samples when downsampling the 30 most common 3-way combinations is shown on the right. Raw expression data underlying the graphs in this figure can be found in S7 Dataset, and code can be found in https://zenodo.org/records/8428609 [65]. Fig B. Factor-wise frequency plots of RNAseq samples before and after subsampling. The number of samples in each family, tissue type, or stress is plotted before (top) and after (bottom) subsampling. Raw expression data underlying the graphs in this figure can be found in S7 Dataset, and code can be found in https://zenodo.org/records/8428609 [65]. Fig C. Topology of Mapper graphs generated from the subsampled data. Samples from each node in the Mapper graph are colored by plant family (A), stress (B), or tissue type (C), using the subsampled data. The overall topology and sample distribution are similar to the Mapper graphs constructed with the full, unbalanced dataset, suggesting that sample distribution is not a major factor in our analyses. Fig D. Linear regression analysis of association of surrogate variables to one batch variable (BioProject), our biological variables of interest (stress, tissue, and family), and their pairwise interactions. All surrogate variables were regressed on either each variable or interaction individually to calculate adjusted R2 values. Table A. Enrichment of GreenCut2 genes in orthogroup-mapped Arabidopsis thaliana genes and stress-/tissue-correlated orthogroup-mapped genes. The proportion of GreenCut2 genes in the all the orthogroups used in this study was compared against the proportion of GreenCut2 genes in a list of all A. thaliana genes using a one-sided binomial test. The proportion of tissue lens and stress lens correlated orthogroup-mapped genes in GreenCut2 was compared against the proportion of GreenCut2 genes in the entire set of orthogroup-mapped genes using one-sided binomial tests. Tissue-correlated genes were hypothesized to be more likely to be in GreenCut2 than a random selection of orthogroup-mapped genes, and the stress-correlated genes were hypothesized to be less likely.

(DOCX)

S1 Dataset. GO term enrichment results on genes negatively correlated with the tissue lens.

(XLSX)

S2 Dataset. GO term enrichment results on genes positively correlated with the tissue lens.

(XLSX)

S3 Dataset. GO term enrichment results on genes positively correlated with the stress lens.

(XLSX)

S4 Dataset. GO term enrichment results on genes positively correlated with the stress lens.

(XLSX)

S5 Dataset. Overlap between orthogroup-mapped genes and tissue lens and stress lens correlated genes with the GreenCut2 resource (Karpowicz).

(XLSX)

S6 Dataset. Metadata of the raw data used in this experiment.

(CSV)

S7 Dataset. Expression matrix of TPMs for the normalized orthogroups.

(CSV)

Abbreviations

GO

gene ontology

PCA

principal component analysis

SRA

sequence read archive

SVA

surrogate variable analysis

TDA

topological data analysis

TPM

transcript per million

t-SNE

t-distributed stochastic neighbor embedding

Data Availability

The code, metadata, and raw datasets from this project are available on a dedicated GitHub page: https://github.com/PlantsAndPython/plant-evo-mapper and Zenodo: https://zenodo.org/records/8428609

Funding Statement

This work was funded primarily by National Science Foundation Research Traineeship training grant (NSF 1828149 to ATM, DHC, and RV) which established the Integrated training Model in Plant And Compu-Tational Sciences (IMPACTS) program at Michigan State University. This grant funded fellows within this program (JAMK, MDR, KSA, CC, JD, RD, TBJ, HRJ, AM, EMR, AMS, JY) as well as the project-based curriculum for the Plants and Python Course that formed the backbone of this manuscript. This work is also supported by NSF Plant Genome Research Program awards IOS-2310355 to EM, DHC, and RV, IOS-2310356 to AH, and IOS-2310357 to AK, NSF Developmental Mechanisms award IOS-2039489 to AH, and NSF Biological Integration Institute award (DBI-2213983 to RV). Several students (JAMK, MDR, KSA, HMP, JP) were supported by predoctoral training award (T32-GM110523 to RV) from the National Institute of General Medical Sciences of the NIH. This project was supported by the USDA National Institute of Food and Agriculture, and by Michigan State University AgBioResearch to AMT, DHC, and RV. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Lim PK, Zheng X, Goh JC, Mutwil M. Exploiting plant transcriptomic databases: Resources, tools, and approaches. Plant Commun. 2022;3:100323. doi: 10.1016/j.xplc.2022.100323 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Washburn JD, Mejia-Guerra MK, Ramstein G, Kremling KA, Valluru R, Buckler ES, et al. Evolutionarily informed deep learning methods for predicting relative transcript abundance from DNA sequence. Proc Natl Acad Sci U S A. 2019;116:5542–5549. doi: 10.1073/pnas.1814551116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Azodi CB, Pardo J, VanBuren R, de Los CG, Shiu S-H. Transcriptome-Based Prediction of Complex Traits in Maize. Plant Cell. 2020;32:139–151. doi: 10.1105/tpc.19.00332 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Singh G, Mémoli F, Carlsson G. Topological methods for the analysis of high dimensional data sets and 3d object recognition. PBG@ Eurographics. doi: 10.2312/spbg.spbg07.091-100/091-100 [DOI] [Google Scholar]
  • 5.Nicolau M, Levine AJ, Carlsson G. Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proc Natl Acad Sci U S A. 2011;108:7265–7270. doi: 10.1073/pnas.1102826108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Rizvi AH, Camara PG, Kandror EK, Roberts TJ, Schieren I, Maniatis T, et al. Single-cell topological RNA-seq analysis reveals insights into cellular differentiation and development. Nat Biotechnol. 2017;35:551–560. doi: 10.1038/nbt.3854 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Proost S, Mutwil M. CoNekT: an open-source framework for comparative genomic and transcriptomic network analyses. Nucleic Acids Res. 2018;46:W133–W140. doi: 10.1093/nar/gky336 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Julca I, Ferrari C, Flores-Tornero M, Proost S, Lindner A-C, Hackenberg D, et al. Comparative transcriptomic analysis reveals conserved programmes underpinning organogenesis and reproduction in land plants. Nat Plants. 2021:1143–1159. doi: 10.1038/s41477-021-00958-2 [DOI] [PubMed] [Google Scholar]
  • 9.Zhang H, Zhang F, Feng L, Jia J, Zhai J. A comprehensive online database for exploring ~20,000 public Arabidopsis RNA-Seq libraries. doi: 10.1101/844522 [DOI] [PubMed] [Google Scholar]
  • 10.Emms DM, Kelly S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 2015;16:157. doi: 10.1186/s13059-015-0721-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Pearson K. On lines and planes of closest fit to systems of points in space. Lond Edinb Dubl Phil Mag J Sci. 1901;2:559–572. [Google Scholar]
  • 12.van der Maaten L, Hinton G. Visualizing Data Using t-SNE. J Mach Learn Res. November/2008. Available: https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf?fbcl [Google Scholar]
  • 13.Tauzin G, Lupo U, Tunstall L, Pérez JB, Caorsi M. giotto-tda:: A Topological Data Analysis Toolkit for Machine Learning and Data Exploration. J Mach Learn Res. Available: https://www.jmlr.org/papers/volume22/20-325/20-325.pdf [Google Scholar]
  • 14.Pathak S, Agarwal A, Ankita A, Gurve MK. Restricted Randomness DBSCAN: A faster DBSCAN Algorithm. 2021 Thirteenth International Conference on Contemporary Computing (IC3-2021). 2021. doi: 10.1145/3474124.3474204 [DOI] [Google Scholar]
  • 15.Carrière M, Oudot S. Structure and stability of the one-dimensional mapper. Found Comut Math. 2018;18:1333–1396. [Google Scholar]
  • 16.Karpowicz SJ, Prochnik SE, Grossman AR, Merchant SS. The GreenCut2 resource, a phylogenomically derived inventory of proteins specific to the plant lineage. J Biol Chem. 2011;286:21427–21439. doi: 10.1074/jbc.M111.233734 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Andersson J, Walters RG, Horton P, Jansson S. Antisense inhibition of the photosynthetic antenna proteins CP29 and CP26: implications for the mechanism of protective energy dissipation. Plant Cell. 2001;13:1193–1204. doi: 10.1105/tpc.13.5.1193 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Meguro M, Ito H, Takabayashi A, Tanaka R, Tanaka A. Identification of the 7-Hydroxymethyl Chlorophyll a Reductase of the Chlorophyll Cycle in Arabidopsis. Plant Cell. 2011:3442–3453. doi: 10.1105/tpc.111.089714 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Murray DL, Kohorn BD. Chloroplasts of Arabidopsis thaliana homozygous for the ch-1 locus lack chlorophyll b, lack stable LHCPII and have stacked thylakoids. Plant Mol Biol. 1991;16:71–79. doi: 10.1007/BF00017918 [DOI] [PubMed] [Google Scholar]
  • 20.Schubert M, Petersson UA, Haas BJ, Funk C, Schröder WP, Kieselbach T. Proteome map of the chloroplast lumen of Arabidopsis thaliana. J Biol Chem. 2002;277:8354–8365. doi: 10.1074/jbc.M108575200 [DOI] [PubMed] [Google Scholar]
  • 21.Albus CA, Ruf S, Schöttler MA, Lein W, Kehr J, Bock R. Y3IP1, a nucleus-encoded thylakoid protein, cooperates with the plastid-encoded Ycf3 protein in photosystem I assembly of tobacco and Arabidopsis. Plant Cell. 2010;22:2838–2855. doi: 10.1105/tpc.110.073908 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Xiao J, Li J, Ouyang M, Yun T, He B, Ji D, et al. DAC Is Involved in the Accumulation of the Cytochrome b 6/f Complex in Arabidopsis. Plant Physiol. 2012:1911–1922. doi: 10.1104/pp.112.204891 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Harmon AC, Gribskov M, Gubrium E, Harper JF. The CDPK superfamily of protein kinases. New Phytol. 2001;151:175–183. doi: 10.1046/j.1469-8137.2001.00171.x [DOI] [PubMed] [Google Scholar]
  • 24.Kruft V, Eubel H, Jänsch L, Werhahn W, Braun HP. Proteomic approach to identify novel mitochondrial proteins in Arabidopsis. Plant Physiol. 2001;127:1694–1710. [PMC free article] [PubMed] [Google Scholar]
  • 25.Millar AH, Sweetlove LJ, Giegé P, Leaver CJ. Analysis of the Arabidopsis mitochondrial proteome. Plant Physiol. 2001;127:1711–1727. [PMC free article] [PubMed] [Google Scholar]
  • 26.Menges M, Hennig L, Gruissem W, Murray JAH. Cell cycle-regulated gene expression in Arabidopsis. J Biol Chem. 2002;277:41987–42002. doi: 10.1074/jbc.M207570200 [DOI] [PubMed] [Google Scholar]
  • 27.Wang C, Wang H, Zhang J, Chen S. A seed-specific AP2-domain transcription factor from soybean plays a certain role in regulation of seed germination. Sci China C Life Sci. 2008;51:336–345. doi: 10.1007/s11427-008-0044-6 [DOI] [PubMed] [Google Scholar]
  • 28.Léon-Kloosterziel KM, van de Bunt GA, Zeevaart JA, Koornneef M. Arabidopsis mutants with a reduced seed dormancy. Plant Physiol. 1996;110:233–240. doi: 10.1104/pp.110.1.233 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Han S, Green L, Schnell DJ. The signal peptide peptidase is required for pollen function in Arabidopsis. Plant Physiol. 2009;149:1289–1301. doi: 10.1104/pp.108.130252 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Zhou J-J, Liang Y, Niu Q-K, Chen L-Q, Zhang X-Q, Ye D. The Arabidopsis general transcription factor TFIIB1 (AtTFIIB1) is required for pollen tube growth and endosperm development. J Exp Bot. 2013;64:2205–2218. doi: 10.1093/jxb/ert078 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Schilmiller AL, Koo AJK, Howe GA. Functional diversification of acyl-coenzyme A oxidases in jasmonic acid biosynthesis and action. Plant Physiol. 2007;143:812–824. doi: 10.1104/pp.106.092916 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Staswick PE, Tiryaki I. The oxylipin signal jasmonic acid is activated by an enzyme that conjugates it to isoleucine in Arabidopsis. Plant Cell. 2004;16:2117–2127. doi: 10.1105/tpc.104.023549 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Lisenbee CS, Lingard MJ, Trelease RN. Arabidopsis peroxisomes possess functionally redundant membrane and matrix isoforms of monodehydroascorbate reductase. Plant J. 2005;43:900–914. doi: 10.1111/j.1365-313X.2005.02503.x [DOI] [PubMed] [Google Scholar]
  • 34.Kim HJ, Oh SA, Brownfield L, Hong SH, Ryu H, Hwang I, et al. Control of plant germline proliferation by SCF(FBL17) degradation of cell cycle inhibitors. Nature. 2008;455:1134–1137. doi: 10.1038/nature07289 [DOI] [PubMed] [Google Scholar]
  • 35.Masuda HP, Ramos GBA, de Almeida-Engler J, Cabral LM, Coqueiro VM, Macrini CMT, et al. Genome based identification and analysis of the pre-replicative complex of Arabidopsis thaliana. FEBS Lett. 2004;574:192–202. doi: 10.1016/j.febslet.2004.07.088 [DOI] [PubMed] [Google Scholar]
  • 36.Collinge MA, Spillane C, Köhler C, Gheyselinck J, Grossniklaus U. Genetic interaction of an origin recognition complex subunit and the Polycomb group gene MEDEA during seed development. Plant Cell. 2004;16:1035–1046. doi: 10.1105/tpc.019059 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Ogura Y, Shibata F, Sato H, Murata M. Characterization of a CENP-C homolog in Arabidopsis thaliana. Genes Genet Syst. 2004;79:139–144. doi: 10.1266/ggs.79.139 [DOI] [PubMed] [Google Scholar]
  • 38.Kaya H, Shibahara KI, Taoka KI, Iwabuchi M, Stillman B, Araki T. FASCIATA genes for chromatin assembly factor-1 in arabidopsis maintain the cellular organization of apical meristems. Cell. 2001;104:131–142. doi: 10.1016/s0092-8674(01)00197-0 [DOI] [PubMed] [Google Scholar]
  • 39.Dou X-Y, Yang K-Z, Ma Z-X, Chen L-Q, Zhang X-Q, Bai J-R, et al. AtTMEM18 plays important roles in pollen tube and vegetative growth in Arabidopsis. J Integr Plant Biol. 2016;58:679–692. doi: 10.1111/jipb.12459 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Broadhvest J, Baker SC, Gasser CS. SHORT INTEGUMENTS 2 promotes growth during Arabidopsis reproductive development. Genetics. 2000;155:899–907. doi: 10.1093/genetics/155.2.899 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Liu H-H, Xiong F, Duan C-Y, Wu Y-N, Zhang Y, Li S. Importin β4 Mediates Nuclear Import of GRF-Interacting Factors to Control Ovule Development in Arabidopsis. Plant Physiol. 2019:1080–1092. doi: 10.1104/pp.18.01135 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Huang B, Qian P, Gao N, Shen J, Hou S. Fackel interacts with gibberellic acid signaling and vernalization to mediate flowering in Arabidopsis. Planta. 2017;245:939–950. doi: 10.1007/s00425-017-2652-5 [DOI] [PubMed] [Google Scholar]
  • 43.Rabadán R, Mohamedi Y, Rubin U, Chu T, Alghalith AN, Elliott O, et al. Identification of relevant genetic alterations in cancer using topological data analysis. Nat Commun. 2020;11:3808. doi: 10.1038/s41467-020-17659-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Mandal S, Guzmán-Sáenz A, Haiminen N, Basu S, Parida L. A Topological Data Analysis Approach on Predicting Phenotypes from Gene Expression Data. Algorithms for Computational Biology. Springer International Publishing; 2020, pp. 178–187. [Google Scholar]
  • 45.Li M, An H, Angelovici R, Bagaza C, Batushansky A, Clark L, et al. Topological Data Analysis as a Morphometric Method: Using Persistent Homology to Demarcate a Leaf Morphospace. Front Plant Sci. 2018;9:553. doi: 10.3389/fpls.2018.00553 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Amézquita EJ, Quigley MY, Ophelders T, Landis JB, Koenig D, Munch E, et al. Measuring hidden phenotype: quantifying the shape of barley seeds using the Euler characteristic transform. in silico Plants. 2021;4:diab033. [Google Scholar]
  • 47.Zeng D, Li M, Jiang N, Ju Y, Schreiber H, Chambers E, et al. TopoRoot: a method for computing hierarchy and fine-grained traits of maize roots from 3D imaging. Plant Methods. 2021. doi: 10.1186/s13007-021-00829-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Rejeb IB, Pastor V, Mauch-Mani B. Plant Responses to Simultaneous Biotic and Abiotic Stress: Molecular Mechanisms. Plants. 2014;3:458–475. doi: 10.3390/plants3040458 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34:i884–i890. doi: 10.1093/bioinformatics/bty560 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods. 2017;14:417–419. doi: 10.1038/nmeth.4197 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Soneson C, Love MI, Robinson MD. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res. 2015;4:1521. doi: 10.12688/f1000research.7563.2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Goodstein DM, Shu S, Howson R, Neupane R, Hayes RD, Fazo J, et al. Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res. 2012;40:D1178–D1186. doi: 10.1093/nar/gkr944 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Buchfink B, Reuter K, Drost H-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods. 2021;18:366–368. doi: 10.1038/s41592-021-01101-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Pedregosa F, Varoquaux G, Gramfort A. Scikit-learn: Machine learning in Python. J Mach Learn. 2011. Available from: https://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf?ref=https://githubhelp.com [Google Scholar]
  • 55.Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007;3:1724–1735. doi: 10.1371/journal.pgen.0030161 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics. 2012;28:882–883. doi: 10.1093/bioinformatics/bts034 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Leek JT. svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Res. 2014;42:e161. doi: 10.1093/nar/gku864 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Chalapathi N, Zhou Y, Wang B. Adaptive Covers for Mapper Graphs Using Information Criteria. 2021 IEEE International Conference on Big Data (Big Data). ieeexplore.ieee.org; 2021, pp. 3789–3800. [Google Scholar]
  • 59.Ester M, Kriegel H-P, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. KDD. Available from: https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf?source=post_page [Google Scholar]
  • 60.Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020;17:261–272. doi: 10.1038/s41592-019-0686-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Klopfenstein DV, Zhang L, Pedersen BS, Ramírez F, Warwick Vesztrocy A, Naldi A, et al. GOATOOLS: A Python library for Gene Ontology analyses. Sci Rep. 2018;8:10872. doi: 10.1038/s41598-018-28948-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, et al. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res. 2012;40:D1202–D1210. doi: 10.1093/nar/gkr1090 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27–30. doi: 10.1093/nar/28.1.27 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Gomes de Oliveira C, Quek L-E, Saa PA, Nielsen LK. A multi-tissue genome-scale metabolic modeling framework for the analysis of whole plant systems. Front Plant Sci. 2015;6:4. doi: 10.3389/fpls.2015.00004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Palande S. PlantsAndPython/plant-evo-mapper: plant-evo-mapper-first-release. 2023. doi: 10.5281/zenodo.8428609 [DOI] [Google Scholar]

Decision Letter 0

Ines Alvarez-Garcia

31 Jan 2023

Dear Dr VanBuren,

Thank you for submitting your manuscript entitled "The topological shape of gene expression across the evolution of flowering plants" for consideration as a Research Article by PLOS Biology. Thank you also for your patience as we completed our editorial process, and please accept my apologies for the delay in providing you with our decision.

Your manuscript has now been evaluated by the PLOS Biology editorial staff as well as by an academic editor with relevant expertise and I am writing to let you know that we would like to send your submission out for external peer review. However, I should note that the outcome of our discussion of your manuscript is that we have some reservations as to the the overall strength of novel biological insight offered by your data. We would need to be persuaded by the reviewers that the paper has the potential to offer the significant strength of advance that we require for publication in order to pursue it further for PLOS Biology.

Before we can send your manuscript to reviewers, we need you to complete your submission by providing the metadata that is required for full assessment. To this end, please login to Editorial Manager where you will find the paper in the 'Submissions Needing Revisions' folder on your homepage. Please click 'Revise Submission' from the Action Links and complete all additional questions in the submission questionnaire.

Once your full submission is complete, your paper will undergo a series of checks in preparation for peer review. After your manuscript has passed the checks it will be sent out for review. To provide the metadata for your submission, please Login to Editorial Manager (https://www.editorialmanager.com/pbiology) within two working days, i.e. by Feb 02 2023 11:59PM.

If your manuscript has been previously peer-reviewed at another journal, PLOS Biology is willing to work with those reviews in order to avoid re-starting the process. Submission of the previous reviews is entirely optional and our ability to use them effectively will depend on the willingness of the previous journal to confirm the content of the reports and share the reviewer identities. Please note that we reserve the right to invite additional reviewers if we consider that additional/independent reviewers are needed, although we aim to avoid this as far as possible. In our experience, working with previous reviews does save time.

If you would like us to consider previous reviewer reports, please edit your cover letter to let us know and include the name of the journal where the work was previously considered and the manuscript ID it was given. In addition, please upload a response to the reviews as a 'Prior Peer Review' file type, which should include the reports in full and a point-by-point reply detailing how you have or plan to address the reviewers' concerns.

During the process of completing your manuscript submission, you will be invited to opt-in to posting your pre-review manuscript as a bioRxiv preprint. Visit http://journals.plos.org/plosbiology/s/preprints for full details. If you consent to posting your current manuscript as a preprint, please upload a single Preprint PDF.

Feel free to email us at plosbiology@plos.org if you have any queries relating to your submission.

Kind regards,

Ines

--

Ines Alvarez-Garcia, PhD

Senior Editor

PLOS Biology

ialvarez-garcia@plos.org

Decision Letter 1

Ines Alvarez-Garcia

5 Apr 2023

Dear Dr VanBuren,

Thank you for your patience while your manuscript entitled "The topological shape of gene expression across the evolution of flowering plants" was peer-reviewed at PLOS Biology. Please also accept my sincere apologies for the delay in providing you with our decision. The manuscript has now been evaluated by the PLOS Biology editors, an Academic Editor with relevant expertise, and by three independent reviewers.

The reviews are attached below. As you will see, the reviewers find the method and conclusions of the manuscript novel and important for the field, but they also raise several concerns that would need to be addressed before we can consider the manuscript for publication. The reviewers think that you should state the limitations of the methods in more detail and address other methodological issues, along with other points that need to be clarified.

In light of the reviews and discussions with the Academic Editor and the rest of the team, we would like to invite you to revise the work to thoroughly address the reviewers' reports. You should address all the points related with the methodological aspect of the paper raised by Reviewer 1 and 2, but we don't think it is necessary to look at other embedding algorithms (UMAP), and present different PCs/tSNE dimensions, as suggested by Reviewer 2. In addition, you should change the article type to Methods and Resources when you submit the revision, as we do think the article would fit better that format.

Given the extent of revision needed, we cannot make a decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is likely to be sent for further evaluation by all or a subset of the reviewers.

We expect to receive your revised manuscript within 3 months. Please email us (plosbiology@plos.org) if you have any questions or concerns, or would like to request an extension.

At this stage, your manuscript remains formally under active consideration at our journal; please notify us by email if you do not intend to submit a revision so that we may withdraw it.

**IMPORTANT - SUBMITTING YOUR REVISION**

Your revisions should address the specific points made by each reviewer. Please submit the following files along with your revised manuscript:

1. A 'Response to Reviewers' file - this should detail your responses to the editorial requests, present a point-by-point response to all of the reviewers' comments, and indicate the changes made to the manuscript.

*NOTE: In your point-by-point response to the reviewers, please provide the full context of each review. Do not selectively quote paragraphs or sentences to reply to. The entire set of reviewer comments should be present in full and each specific point should be responded to individually, point by point.

You should also cite any additional relevant literature that has been published since the original submission and mention any additional citations in your response.

2. In addition to a clean copy of the manuscript, please also upload a 'track-changes' version of your manuscript that specifies the edits made. This should be uploaded as a "Revised Article with Changes Highlighted" file type.

3. Resubmission Checklist

When you are ready to resubmit your revised manuscript, please refer to this resubmission checklist: https://plos.io/Biology_Checklist

To submit a revised version of your manuscript, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' where you will find your submission record.

Please make sure to read the following important policies and guidelines while preparing your revision and fulfil the editorial requests:

a) *PLOS Data Policy*

Please note that as a condition of publication PLOS' data policy (http://journals.plos.org/plosbiology/s/data-availability) requires that you make available all data used to draw the conclusions arrived at in your manuscript. If you have not already done so, you must include any data used in your manuscript either in appropriate repositories, within the body of the manuscript, or as supporting information (N.B. this includes any numerical values that were used to generate graphs, histograms etc.). Please also indicate in each figure legend where the data can be found. For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5

b) *Published Peer Review*

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

c) *Blurb*

Please also provide a blurb which (if accepted) will be included in our weekly and monthly Electronic Table of Contents, sent out to readers of PLOS Biology, and may be used to promote your article in social media. The blurb should be about 30-40 words long and is subject to editorial changes. It should, without exaggeration, entice people to read your manuscript. It should not be redundant with the title and should not contain acronyms or abbreviations. For examples, view our author guidelines: https://journals.plos.org/plosbiology/s/revising-your-manuscript#loc-blurb

d) *Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosbiology/s/submission-guidelines#loc-materials-and-methods

Thank you again for your submission to our journal. We hope that our editorial process has been constructive thus far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Ines

--

Ines Alvarez-Garcia, PhD

Senior Editor

PLOS Biology

ialvarez-garcia@plos.org

------------------------------------

Reviewers' comments

Rev. 1:

In the manuscript entitled, "The topological shape of gene expression across the evolution of flowering plants", Palande et al. present an innovative new analysis method for identifying trends in large collections of data. They take transcriptomes of different tissues and under different stresses from species across the phylogeny of flowering plants and analyze them all together with topological data analysis. They represent the shape of gene expression across a tissue lens and find a continuum from leaves to seeds with some branches for specific tissues in various plant families. The genes underlying the difference in tissues generally relate to photosynthesis in leaves versus core metabolic processes such as ubiquitination and Golgi vesicle-mediated transport in non-leaf tissues. Their work on stress is less conclusive, as healthy and stressed samples are often mixed at nodes. However, they did identify a general trend from healthy to stressed. They also found a negative correlation between stressed tissues and growth/cell division, which is a well-known trade off plants make, suggesting fundamental biology can be accessed by these analysis methods. To my knowledge this is the first use to analyze expression patterns in plants, and there are only a handful of uses in all of biology. Overall, the method is novel and adds a key new tool to analyze complex high dimensional data, particularly across species.

Major concerns

1. We ask the authors to consider and further discuss possible limitations of their method in more detail. Specifically, how does the proportion of samples in the input data affect the weight of output terms? For example, the dataset analyzed has high numbers of samples with drought, heat, and salt stress, which may explain why water deprivation comes up as a major output term for stress response overall. Something else might come up if stress samples were more balanced. Likewise, because Poaceae are a large fraction of the sample species input, they might play a big part in shaping the stress outputs. It would be worth re-running the analysis without Poaceae to see what effect that has on the output.

2. Similarly, the authors restrict the analysis to 6,328 orthologous low copy genes that are conserved across the angiosperm species used. This is the only way to do the analysis, but it eliminates a lot of important genes from the analysis. For instance, most of the transcription factors and signaling pathway components controlling development are members of large protein families that are excluded from this analysis. This undoubtedly tilts the tissue results toward more core biological functions such as photosynthesis, ubiquitination, endomembrane trafficking, etc., instead of classic developmental regulators. The authors should discuss how this limitation of genes included in the analysis shapes and limits the outputs.

3. This paper is introducing topological data analysis to most readers for the first time, so it is imperative to create a better schematic diagram explaining the analysis method to the reader. Figure 2a and 2b are not sufficient. The text lines ~167-172 describes a step wise process to arrive at the final shape. It would be worth creating a detailed step by step diagram illustrating the analysis process that directly relates to the steps described. Also please label the lens function with what it is, not just lens function #1.

Minor comments:

1. Line 54-55 "thousands of diverse plant species spanning over 900 million years of evolution" The sampled plants are all extant plants, so they do not span evolution. Maybe you mean some of the plants have diverged many million years ago?

2. What is in the "Other" category for tissue and stress?

3. Supplemental datasets 6 (mentioned in line 355) and 7 (mentioned in line 380) are missing from the submission (no link to them on the document).

4. For Stress the authors say there are 17 nodes in text, but there are 18 nodes labeled on the figure.

5. Please label which end is the stress end and which is the healthy end on Figure 4.

6. In Figure 2F, what does the box mean? Why isn't there an equivalent box on the tissue side, in Figure 2E?

7. Please check your color scheme for colorblind friendliness. Even for full color vision people, the pie charts in Figure 4 can be hard to see.

8. Figure 3 seems to be mis-titled as stress lens, and is the tissue lens.

Rev. 2:

Palande et al report on a meta-analysis of transcriptomic data throughout the angiosperms. By simplifying highly dimensional gene expression data from many plant species with a topological simplification tool, Mapper, the authors have identified global features in the data reflecting developmental, taxonomic and environmental variables.

The approach is very original and the conceptual framework seems quite promising for synthesizing important features in the data. In principle, visualizations using dimensionality reduction can generate an ensemble of shapes, and as such interpreting these shapes could be misleading or even meaningless - I think this point should be emphasized in the introduction. How meaningful is the shape - what is the biological significance of having "loops, branch point, flares" (line 92)?

I think the manuscript should emphasize more clearly what is the advantage of presenting the results with mapper graphs, and what novel conclusions could we draw (the source/sink continuum is apparent on the PC plot). It may be useful to look at other embedding algorithms (UMAP), and present different PCs/tSNE dimensions.

How biased is the dataset to species or stresses? It appears less biased to tissues, where most of the variation is detected. This is data with substantial sparsity and likely very sensitive to sampling.

More generally, I am not sure if gene expression having any "structure" or "shape" is the most appropriate way to phrase the findings of the study. The resulting "shape" is a representation of the gene expression, rather that a guiding force for resulting phenotypes (line 95).

Line 120 - Amborella is not a "basal" angiosperm - please rephrase to "sister of the rest of the angiosperms (for in-depth discussion, please check Stacey Smith's blog)

Rev. 3:

This study by Palande et al. entitled The topological shape of gene expression across the evolution of flowering plants harnesses the data dimensionality reduction procedure Mapper to derive a topological understanding of gene expression across diverse plant species. The study explores 2,671 SRA samples from 54 diverse plant species which are phylogenetically joined by 6,328 conserved orthogroups. When exploring the differences between tissue expression and stress expression (abiotic vs biotic) the authors interpret their projection and modelling outcomes as distinct and conserved expression patterns across angiosperms that delineate different tissue types or responses to biotic and abiotic stresses based on their agreement with a particular lens-function (e.g. length of residual vector).

While this study provides a promising opportunity to aggregate some of the vast sample space of the NCBI SRA database, it fails to deliver substantiating evidence and presents significant analytical, bioinformatic, and conceptual shortcomings that would need to be addressed in a major revision. If their conclusions still hold true after the sufficient scrutiny of (yet missing) negative controls and cross-sample data normalization, this study could provide value for a specialised audience that seeks to replicate their topological procedure for other SRA samples.

Major comments:

Overall, the two most important resources for reproducibility and study assessment were not provided for review by the authors: The final expression dataset (Suppl Data 7) and the GitHub repository (Data availability: The code, metadata, and raw datasets from this project are available on a dedicated GitHub page: <empty space>).

Methodological shortcomings:

- Assembly of gene expression data: While the intuition to generate TPM values based on a standardized pipeline is correct, the lack of any normalization across either tissues or across species or across stress conditions or across similar data quality is making the procedure of deriving comparable expression level estimates highly biased and insufficient. Without sufficient normalization and accounting for sequencing depth, coverage, sequencing technology conformity, contamination checks, batch effect correction, etc any observed topological pattern could be explained by these technical artifacts. Since none of these quality-control steps were performed and no controls to suggest otherwise were presented, I am not convinced that the presented topologies represent true biological signature. I have to assume that these patterns are largely driven by differences in their technical treatment and highly diverse experimental designs.

- Mathematical basis of topological data analysis: The authors present the Mapper approach as the only alternative compared to established dimensionality reduction methods that was able to present them with patterns they felt comfortable to interpret (p. 8 "We first tested traditional dimensionality reduction and clustering-based approaches but found they were largely ineffective and unable to clearly resolve samples. Instead, we used a novel topological framework to compare samples and test for evolutionary conservation."). How can the authors ensure that traditional dimensionality reduction and clustering-based approaches didn't fail, because they (in fact intrinsically) captured the bias of the non-standardized data? How would the authors argue against the argument that they cherry-picked a method that allowed them to show any pattern? The Mapper approach is also known to have analogous weaknesses as traditional dimensionality reduction methods such as determining an appropriate and agnostic number of overlapping intervals used to cover the data and the robustness of diverse clustering algorithm (and distance metric) able to consistently group similar points together. I interpret the claim "We experimented with a range of value lengths of the intervals and the size of the overlap to identify the values that produced relatively stable mapper graphs." as visually/manually cherry-picking the most favourable topologies rather than relying on a robust and objective metric.

The fact that furthermore "The clustering was performed using DBSCAN, a commonly used clustering algorithm in Mapper [14]." without demonstrating the robustness of topologies to differences in clustering methods further strengthens my suspicion. The claim that other studies also used the Mapper approach on gene expression data despite the open problem in TDA research to optimize Mapper parameters is not convincing at this stage where traditional dimensionality reduction methods fail.

- For the claim that most of the community assumes that biotic and abiotic stress are evolutionarily independent (not clear to me what this means), no reference is presented and their finding that biotic and abiotic stress samples (therefore unexpectedly) showed similar topologies/patterns/distributions further illustrates the importance to standardize and normalize across SRA samples, tissues, species etc. The fact that biotic and abiotic stress samples show similar patterns/topologies may simply be the result of the fact that most stress responses involve a much smaller number of genes than tissue development or housekeeping and thus technical artifacts are more pronounced under these (tissue, housekeeping, non-stress) conditions. Since no control experiments are provided, I am not convinced that these stress similarities are in fact true biological signatures.

Decision Letter 2

Ines Alvarez-Garcia

16 Aug 2023

Dear Dr VanBuren,

Thank you for your patience while we considered your revised manuscript entitled "The topological shape of gene expression across the evolution of flowering plants" for publication as a Research Article at PLOS Biology. This revised version of your manuscript has been evaluated by the PLOS Biology editors, the Academic Editor and the original reviewers.

The reviews are attached below. Based on these comments, we are likely to accept this manuscript for publication, provided you address the remaining points raised by Reviewer 3. In addition, please address the data and other policy-related requests stated below.

In addition, we would like you to consider a suggestion to improve the title:

"Topological data analysis across the evolution of flowering plants reveals a core gene expression backbone that defines plant form and function"

As you address these items, please take this last chance to review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the cover letter that accompanies your revised manuscript.

We expect to receive your revised manuscript within two weeks.

To submit your revision, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' to find your submission record. Your revised submission must include the following:

- a cover letter that should detail your responses to any editorial requests, if applicable, and whether changes have been made to the reference list

- a Response to Reviewers file that provides a detailed response to the reviewers' comments (if applicable)

- a track-changes file indicating any changes that you have made to the manuscript.

NOTE: If Supporting Information files are included with your article, note that these are not copyedited and will be published as they are submitted. Please ensure that these files are legible and of high quality (at least 300 dpi) in an easily accessible file format. For this reason, please be aware that any references listed in an SI file will not be indexed. For more information, see our Supporting Information guidelines:

https://journals.plos.org/plosbiology/s/supporting-information

*Published Peer Review History*

Please note that you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*Press*

Should you, your institution's press office or the journal office choose to press release your paper, please ensure you have opted out of Early Article Posting on the submission form. We ask that you notify us as soon as possible if you or your institution is planning to press release the article.

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please do not hesitate to contact me should you have any questions.

Sincerely,

Ines

--

Ines Alvarez-Garcia, PhD

Senior Editor

PLOS Biology

ialvarez-garcia@plos.org

------------------------------------------------------------------------

DATA POLICY: IMPORTANT - PLEASE READ

You may be aware of the PLOS Data Policy, which requires that all data be made available without restriction: http://journals.plos.org/plosbiology/s/data-availability. For more information, please also see this editorial: http://dx.doi.org/10.1371/journal.pbio.1001797

Note that we do not require all raw data. Rather, we ask that all individual quantitative observations that underlie the data summarized in the figures and results of your paper be made available in one of the following forms:

1) Supplementary files (e.g., excel). Please ensure that all data files are uploaded as 'Supporting Information' and are invariably referred to (in the manuscript, figure legends, and the Description field when uploading your files) using the following format verbatim: S1 Data, S2 Data, etc. Multiple panels of a single or even several figures can be included as multiple sheets in one excel file that is saved using exactly the following convention: S1_Data.xlsx (using an underscore).

2) Deposition in a publicly available repository. Please also provide the accession code or a reviewer link so that we may view your data before publication.

Regardless of the method selected, please ensure that you provide the individual numerical values that underlie the summary data displayed in the following figure panels as they are essential for readers to assess your analysis and to reproduce it:

Fig. 1B-E; Fig. 2F, G; Fig. S1 and Fig. S2

NOTE: the numerical data provided should include all replicates AND the way in which the plotted mean and errors were derived (it should not present only the mean/average values).

Please also ensure that figure legends in your manuscript include information on WHERE THE UNDERLYING DATA CAN BE FOUND, and ensure your supplemental data file/s has a legend.

Please ensure that your Data Statement in the submission system accurately describes where your data can be found.

**Please also provide at this stage all the accession numbers and DOIs.

------------------------------------------------------------------------

BLURB

Please also provide a blurb which (if accepted) will be included in our weekly and monthly Electronic Table of Contents, sent out to readers of PLOS Biology, and may be used to promote your article in social media. The blurb should be about 30-40 words long and is subject to editorial changes. It should, without exaggeration, entice people to read your manuscript. It should not be redundant with the title and should not contain acronyms or abbreviations. For examples, view our author guidelines: https://journals.plos.org/plosbiology/s/revising-your-manuscript#loc-blurb

------------------------------------------------------------------------

Reviewers' comments

Rev. 1:

The authors have carefully and thoroughly addressed all of the points we have raised. Their further analysis has shown the robustness of their method to the imperfect distribution of public datasets. We commend the authors on this innovative work.

Rev. 2:

Thank you for addressing my comments - the manuscript is ready for publication.

Rev. 3:

I appreciate the authors efforts to address my concerns. Some of my concerns were successfully approached and are now resolved, but a few major concerns remain unresolved which need further attention. I therefore recommend to address these concerns in full in a minor revision.

Rev. 3 (previous comments):

This study by Palande et al. entitled The topological shape of gene expression across the evolution of flowering plants harnesses the data dimensionality reduction procedure Mapper to derive a topological understanding of gene expression across diverse plant species. The study explores 2,671 SRA samples from 54 diverse plant species which are phylogenetically joined by 6,328 conserved orthogroups. When exploring the differences between tissue expression and stress expression (abiotic vs biotic) the authors interpret their projection and modelling outcomes as distinct and conserved expression patterns across angiosperms that delineate different tissue types or responses to biotic and abiotic stresses based on their agreement with a particular lens-function (e.g. length of residual vector).

While this study provides a promising opportunity to aggregate some of the vast sample space of the NCBI SRA database, it fails to deliver substantiating evidence and presents significant analytical, bioinformatic, and conceptual shortcomings that would need to be addressed in a major revision. If their conclusions still hold true after the sufficient scrutiny of (yet missing) negative controls and cross-sample data normalization, this study could provide value for a specialised audience that seeks to replicate their topological procedure for other SRA samples.

Major comments:

Overall, the two most important resources for reproducibility and study assessment were not provided for review by the authors: The final expression dataset (Suppl Data 7) and the GitHub repository (Data availability: The code, metadata, and raw datasets from this project are available on a dedicated GitHub page: <empty space>).

Authors: We apologize for the oversight here. Supplemental Dataset 7 is now available on GitHub and in the revision. This file is relatively large (~200 Mb), and we were not able to upload it in the first submission. The GitHub link was corrected, and is now active as well: https://github.com/PlantsAndPython/plant-evo-mapper

Methodological shortcomings:

Authors: Assembly of gene expression data: While the intuition to generate TPM values based on a standardized pipeline is correct, the lack of any normalization across either tissues or across species or across stress conditions or across similar data quality is making the procedure of deriving comparable expression level estimates highly biased and insufficient. Without sufficient normalization and accounting for sequencing depth, coverage, sequencing technology conformity, contamination checks, batch effect correction, etc any observed topological pattern could be explained by these technical artifacts. Since none of these quality-control steps were performed and no controls to suggest otherwise were presented, I am not convinced that the presented topologies represent true biological signature. I have to assume that these patterns are largely driven by differences in their technical treatment and highly diverse experimental designs

This is an important point and we agree that standardization across the experiment is essential. We feel our analyses are highly standardized, statistically robust, and largely free of technical artifacts beyond the inherent noise of RNAseq data.

Rev-3: While I appreciate the authors' confidence in the SRA database, would it be possible to specifically state how normalization ACROSS diverse SRA samples were performed to provide concrete evidence for their strong claim that "[…] our analyses are highly standardized, statistically robust, and largely free of technical artifacts beyond the inherent noise of RNAseq data"? Did the authors rely on the "SRA Normalized Format" (default, see e.g. https://www.ncbi.nlm.nih.gov/sra/docs/data-format-faq/) as input to their analyses? If yes, did the authors check whether the samples giving strongest biological signature (especially the ones not captured by alternative dimensionality reduction approaches) still have a high "original quality score" in the non-SRA normalized samples (see FAQs in the previous link)? This important analysis detail should be made clear in the manuscript.

Authors: Below we include a detailed summary of the various QC metrics we used and clarified this in text (including a new supplemental note about surrogate variable analysis).

We used Surrogate Variable Analysis (SVA) (Leek et al. 2012) to explore the effects of confounding technical variables on the publicly available SRA data assembled for this study. Briefly, we identified three primary variables of interest (tissue, stress, and family), which were fixed in the model used to estimate "surrogate variables" to minimize the amount of variability attributable to these primary variables captured by the estimated surrogate variables (see Supplementary Methods for Surrogate Variable Analysis). These surrogate variables represent unaccounted for technical variables impacting the dataset. Due to the breadth of families, stresses, and tissues analyzed, we do not have a full factorial design (i.e., there are combinations of family, stress, and tissue factor values for which there are no expression datasets). Because of this, SVA would remove variability due to our primary variables and their interactions. To get a sense of what kind of impact the surrogate variables might have on the dataset when removed, we estimated the correlation between the first order interactions between our primary variables and the surrogate variables identified by SVA. We identified 24 surrogate variables which individually captured between 53% and 98% of variation between BioProjects (Supplemental Figure 4). We also estimated the interaction terms between the tissue, family, and stress factor combinations that were present in the dataset and estimated how much of their variation was getting captured by the surrogate variables. Individual surrogate variables captured up to 14% of variation between stress conditions, up to 66% of variation between tissue conditions, and up to 63% of variation between families. For the interaction terms between primary variables, individual surrogate variables captured up to 83% of the variation between tissue and family combinations, up to 65% of the variation between stress and family combinations, and up to 71% of the variation between tissue and stress combinations. This suggests that even though stress, tissue, and family are treated as protected primary variables, there are underlying latent variables related to our primary variables and their interactions that may be important sources of biological variation being captured by the surrogate variables. Although individual surrogate variables could be selectively accounted for in downstream analyses in such a way that minimizes the removal of biological signal, this would be a highly subjective process. Moreover, due to our inability to precisely calculate the true correlation between our surrogate variables and interaction terms due to the fact that many factor combinations are missing, this would be statistically dubious as well.

Because the surrogate variables show substantial linear correlation with our primary variables and their interaction terms, the application of SVA would require eliminating substantial amounts of biological signal. Since the goal of our study is to identify heterogeneous patterns due to stress, tissue, and family within a high-dimensional gene expression dataset, SVA may not be appropriate for us to use. Alternatively, one could potentially minimize the loss of this signal by cherry-picking individual surrogate variables to include in downstream analysis, which would naturally introduce human bias. A third option would be to use an algorithm like ComBat-seq (Zhang, Parmigiani, and Johnson 2020) that relies on explicitly defined batches, which is problematic for the present study since the closest metadata for batch available for the studies gathered on SRA is the BioProject ID's, but these are, at best, a proxy for batches of samples and are not sufficient to assess the technical variability or noise in the data. More broadly, as discussed in (Jaffe et al. 2015), such genomic data "cleaning" methods, by their very nature, delimit the observable features of the resulting datasets to those prespecified by the investigator. In our view, this limits their utility for broad exploratory analyses of the kind described in this study. For all the above reasons, we opted to not use SVA, ComBat, or related techniques prior to downstream analyses. These shortcomings also emphasize the need for tools like Mapper that can, as shown in this manuscript, reveal patterns that are amenable to downstream analysis.

Rev-3: Thank you for this detailed assessment of why their input samples are too heterogeneous or too sparsely sampled for sufficient batch effect correction. This was actually one of my major concerns of this study that if classic batch-effect smoothing/removal methods fail, dimensionality reduction will only project this shortcoming into a lower dimensional space. The authors' argument that since their input data represents an insufficient sampling to enable various batch effect removal methods, their dimensionality reduction approach can reveal true biological patterns is not clear to me, since no data or evidence is presented for this claim. The core principle of batch effect removal relies on exploring the nature of the variance in sufficiently sampled data and so does dimensionality reduction. Thus, both methods should be sensitive to insufficient sampling and data heterogeneity. Do the authors disagree?

Authors: The raw TPM values have some degree of standardization by library size (i.e., number of reads per sample), and we transformed all expression values by Z-score prior to any downstream analyses.

Rev-3: Thank you for clarifying. Was this Z-score standardization performed on the "SRA Normalized Format" (default, see e.g. https://www.ncbi.nlm.nih.gov/sra/docs/data-format-faq/) or on the SRA raw samples (non-SRA normalized samples)? This important analysis detail should be made clear in the manuscript.

Authors: The Z-score enables cross-species comparisons as values within each dataset are normalized to a common scale based on their standard deviation and mean rather than absolute values. Dimensionality reduction (t-SNE and PCA) from the z-score transformed expression in Figure 1 shows a clear separation of different plant tissues across species, suggesting we are identifying real developmental patterns in our dataset and not technological artifacts. Labeling the samples by technology, year the dataset was published, BioProject, or other variables showed no correlation.

Rev-3: Excellent! Would it be possible to add this analysis ("Labeling the samples by technology, year the dataset was published, BioProject, or other variables showed no correlation.") as Supplementary Figures? This should give readers more confidence in the methodology.

Authors: Tissue patterns are quite clear, and there is no reasonable way that technical artifacts could be causing this delineation as any artifacts or variability would be found within all sample types and species and not within a specific factor such as tissue, stress, or species that could create a misleading pattern. Furthermore, GO term analysis identified sets of genes that are consistent with the lens function we were using, such as photosynthetic genes delineating leaves from other tissues and stress responsive genes delineating healthy from unhealthy tissues. If our analyses were picking up on technical artifacts instead of biological patterns, there should be no enriched GO terms that are consistent with our classification.

Rev-3: Thank you for performing this quality control analysis. I agree with the authors that this is promising evidence. Did the authors have a chance to confirm this analysis with a negative control whereby samples with clear signatures of technical artifacts do not pick up enriched GO-terms that would be consistent with their lens function? If this is indeed the case, then readers will appreciate this analysis by placing more confidence in their method.

Authors: We hope this clarifies the concerns raised by this reviewer and showcases the utility of our dataset and approach.

- Mathematical basis of topological data analysis: The authors present the Mapper approach as the only alternative compared to established dimensionality reduction methods that was able to present them with patterns they felt comfortable to interpret (p. 8 "We first tested traditional dimensionality reduction and clustering-based approaches but found they were largely ineffective and unable to clearly resolve samples. Instead, we used a novel topological framework to compare samples and test for evolutionary conservation."). How can the authors ensure that traditional dimensionality reduction and clustering-based approaches didn't fail, because they (in fact intrinsically) captured the bias of the non-standardized data? How would the authors argue against the argument that they cherry-picked a method that allowed them to show any pattern? The Mapper approach is also known to have analogous weaknesses as traditional dimensionality reduction methods such as determining an appropriate and agnostic number of overlapping intervals used to cover the data and the robustness of diverse clustering algorithm (and distance metric) able to consistently group similar points together. I interpret the claim "We experimented with a range of value lengths of the intervals and the size of the overlap to identify the values that produced relatively stable mapper graphs." as visually/manually cherry-picking the most favourable topologies rather than relying on a robust and objective metric.

Authors: We are receptive to the concern that, due to the lack of concordance between dimensionality approaches like PCA and our Mapper graphs and the lack of a robust hyperparameter tuning procedure for TDA, it may appear that we are cherry-picking favorable topologies to present.

Rev-3: Thank you for confirming that TDA lacks a robust hyperparameter tuning procedure. My major concern in this context was not that the authors cherry-picked their results, but that users of their method when broadly employed to various datasets may feel inclined to cherry-pick in the absence of an unbiased hyperparameter tuning procedure. This point needs to be extensively discussed in the main manuscript and clear guidelines presented to avoid (un)intentional cherry-picking by the end user.

Authors: We believe it is reasonable to assume that if the structures we are presenting are either (a) highly sensitive to the exact set of samples in the dataset, or (b) not reproducible using intuitive related, but distinct, lens functions, this would undermine the strength of our results. However, as shown in our responses demonstrating robustness to downsampling and the supplemental data showing results from using roots as a lens, our results are robust in both cases. This gives us confidence in the results we are presenting.

Rev-3: While I agree with this statement and appreciate the analysis, the concern lies rather with the fact that less nuanced input data (as could be explored by users of the TDA method) could demonstrate less robustness (which was never tested by the end user). Thus, this type of "robustness confirmation analysis" should be clearly communicated to the user.

Authors: We tested the most commonly used dimensionality reduction approaches for expression-based datasets including MDS, t-sne, and PCA as well as hierarchical clustering approaches, and it is certainly possible that other methodologies could produce similar topologies to our TDA based analyses, but we are quite skeptical. We agree with the reviewer that these analyses may be picking up technical differences between experiments such as variations in how, when, and which exact tissues were collected, differences in genotype, or variation in the duration or magnitude of stress. This is somewhat supported by our results from the surrogate variable analysis, which was unable to remove these artifacts without removing the variables we were testing (e.g, tissues, species, or stresses).

Rev-3: I greatly appreciate that the authors confirm my major concerns and that established dimensionality reduction methods are not robust to technical or sampling artifacts.

Authors: This is really the core finding of our manuscript: TDA can find hidden biological structure in complex, noisy datasets that is missed by traditional dimensionality reduction.

Rev-3: It remains unclear to me what the actual evidence is that TDA can find hidden biological structure not captured by traditional methods. The only evidence presented is the GO-term analysis that matches the lens function. But for this analysis no control is presented. It is therefore paramount to present this control (see corresponding point above) to have at least some form of evidence for this claim.

Authors: The fact that furthermore "The clustering was performed using DBSCAN, a commonly used clustering algorithm in Mapper [14]." without demonstrating the robustness of topologies to

differences in clustering methods further strengthens my suspicion. The claim that other studies also used the Mapper approach on gene expression data despite the open problem in TDA research to optimize Mapper parameters is not convincing at this stage where traditional dimensionality reduction methods fail.

While DBSCAN is a commonly used clustering algorithm in Mapper, we acknowledge that demonstrating the robustness of topologies to different clustering methods would be valuable. In our study, we used DBSCAN as it has been shown to perform well in various applications and has been widely adopted in the field. DBSCAN doesn't require specifying the number of clusters, so that eliminates testing most other clustering algorithms.

Regarding the claim that other studies have also used the Mapper approach despite the open problem in TDA research to optimize Mapper parameters, we understand your skepticism. However, it is worth noting that the Mapper approach has proven to be effective in capturing topological structures in diverse datasets, including gene expression data. While traditional dimensionality reduction methods may struggle to handle the complexity and non-linearity of gene expression data, Mapper provides a valuable alternative for uncovering meaningful patterns. We did provide some parameter optimization across cover intervals and interval number for both of the lens functions, and these had little to no effect on the backbone Mapper graph, so again, we feel our results are robust.

- For the claim that most of the community assumes that biotic and abiotic stress are evolutionarily independent (not clear to me what this means), no reference is presented and their finding that biotic and abiotic stress samples (therefore unexpectedly) showed similar topologies/patterns/distributions further illustrates the importance to standardize and normalize across SRA samples, tissues, species etc. The fact that biotic and abiotic stress samples show similar patterns/topologies may simply be the result of the fact that most stress responses involve a much smaller number of genes than tissue development or housekeeping and thus technical artifacts are more pronounced under these (tissue, housekeeping, non-stress) conditions. Since no control experiments are provided, I am not convinced that these stress similarities are in fact true biological signatures.

This is an excellent point. We agree with the reviewer, the claim that biotic and abiotic stress are evolutionarily independent is contentious and unclear, and we have revised this in the text for clarity and to provide support for our statements. We have modified this section as shown below:

"Abiotic and biotic stress responses have been mostly studied in isolation, but they typically co-occur in natural environments, and they have overlapping signaling, hormonal, and network responses in plants (reviewed in (Rejeb et al. 2014)). The topological shape of gene expression points to a shared set of pathways or perturbations that define if a tissue is healthy or stressed. Environmental stresses broadly disrupt photosynthesis, core metabolic and cellular functions as either a direct response to physical trauma, or in preparation for defense or resilience. These changes may serve as the backbone of the topological shape we observed for the stress lens."

Most stress responses involve massive transcriptional reprogramming, especially when the stress is severe, and typically thousands of genes have differential expression or regulation under stress compared to control. This includes many genes with roles in steady state processes such as photosynthesis, metabolism, growth, and core cellular characteristics, as well as stress signaling and numerous downstream stress response pathways. Together, this intense shift likely underlies the topological shape we observed. Furthermore, we observed stressed samples from multiple tissues clustering together within the mapper graph and some separation of individual stresses such as high light or cold, suggesting that these signals are far stronger than any technical or background effects.

We disagree that control experiments are not provided as nearly every stress sample has a comparable healthy or control sample and these are used as the foundation for the stress lens function.

Decision Letter 3

Ines Alvarez-Garcia

20 Oct 2023

Dear Dr VanBuren,

Thank you for the submission of your revised Research Article entitled "Topological data analysis reveals a core gene expression backbone that defines form and function across flowering plants" for publication in PLOS Biology. On behalf of my colleagues and the Academic Editor, Hajk-Georg Drost, I am delighted to let you know that we can in principle accept your manuscript for publication, provided you address any remaining formatting and reporting issues. These will be detailed in an email you should receive within 2-3 business days from our colleagues in the journal operations team; no action is required from you until then. Please note that we will not be able to formally accept your manuscript and schedule it for publication until you have completed any requested changes.

Please take a minute to log into Editorial Manager at http://www.editorialmanager.com/pbiology/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production process.

PRESS

We frequently collaborate with press offices. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximise its impact. If the press office is planning to promote your findings, we would be grateful if they could coordinate with biologypress@plos.org. If you have previously opted in to the early version process, we ask that you notify us immediately of any press plans so that we may opt out on your behalf.

We also ask that you take this opportunity to read our Embargo Policy regarding the discussion, promotion and media coverage of work that is yet to be published by PLOS. As your manuscript is not yet published, it is bound by the conditions of our Embargo Policy. Please be aware that this policy is in place both to ensure that any press coverage of your article is fully substantiated and to provide a direct link between such coverage and the published work. For full details of our Embargo Policy, please visit http://www.plos.org/about/media-inquiries/embargo-policy/.

Many congratulations and thanks again for choosing PLOS Biology for publication and supporting Open Access publishing. We look forward to publishing your study. 

Sincerely, 

Ines

--

Ines Alvarez-Garcia, PhD

Senior Editor

PLOS Biology

ialvarez-garcia@plos.org

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Text

    Fig A. Histogram of 3-way factors of the RNAseq samples before and after downsampling. The distribution of 3-way factors for family, tissue, and stress is plotted. The 16 families, 8 tissue types, and 10 stresses equate to 1,280 unique 3-way combinations, but we only observed 195 unique combinations in our dataset. The distribution of samples from the entire dataset is shown on the left, and the distribution of samples when downsampling the 30 most common 3-way combinations is shown on the right. Raw expression data underlying the graphs in this figure can be found in S7 Dataset, and code can be found in https://zenodo.org/records/8428609 [65]. Fig B. Factor-wise frequency plots of RNAseq samples before and after subsampling. The number of samples in each family, tissue type, or stress is plotted before (top) and after (bottom) subsampling. Raw expression data underlying the graphs in this figure can be found in S7 Dataset, and code can be found in https://zenodo.org/records/8428609 [65]. Fig C. Topology of Mapper graphs generated from the subsampled data. Samples from each node in the Mapper graph are colored by plant family (A), stress (B), or tissue type (C), using the subsampled data. The overall topology and sample distribution are similar to the Mapper graphs constructed with the full, unbalanced dataset, suggesting that sample distribution is not a major factor in our analyses. Fig D. Linear regression analysis of association of surrogate variables to one batch variable (BioProject), our biological variables of interest (stress, tissue, and family), and their pairwise interactions. All surrogate variables were regressed on either each variable or interaction individually to calculate adjusted R2 values. Table A. Enrichment of GreenCut2 genes in orthogroup-mapped Arabidopsis thaliana genes and stress-/tissue-correlated orthogroup-mapped genes. The proportion of GreenCut2 genes in the all the orthogroups used in this study was compared against the proportion of GreenCut2 genes in a list of all A. thaliana genes using a one-sided binomial test. The proportion of tissue lens and stress lens correlated orthogroup-mapped genes in GreenCut2 was compared against the proportion of GreenCut2 genes in the entire set of orthogroup-mapped genes using one-sided binomial tests. Tissue-correlated genes were hypothesized to be more likely to be in GreenCut2 than a random selection of orthogroup-mapped genes, and the stress-correlated genes were hypothesized to be less likely.

    (DOCX)

    S1 Dataset. GO term enrichment results on genes negatively correlated with the tissue lens.

    (XLSX)

    S2 Dataset. GO term enrichment results on genes positively correlated with the tissue lens.

    (XLSX)

    S3 Dataset. GO term enrichment results on genes positively correlated with the stress lens.

    (XLSX)

    S4 Dataset. GO term enrichment results on genes positively correlated with the stress lens.

    (XLSX)

    S5 Dataset. Overlap between orthogroup-mapped genes and tissue lens and stress lens correlated genes with the GreenCut2 resource (Karpowicz).

    (XLSX)

    S6 Dataset. Metadata of the raw data used in this experiment.

    (CSV)

    S7 Dataset. Expression matrix of TPMs for the normalized orthogroups.

    (CSV)

    Attachment

    Submitted filename: Respone_to_Reviewers.docx

    Attachment

    Submitted filename: PLoS Biology reviewer responses 8-30-23.docx

    Data Availability Statement

    The code, metadata, and raw datasets from this project are available on a dedicated GitHub page: https://github.com/PlantsAndPython/plant-evo-mapper and Zenodo: https://zenodo.org/records/8428609


    Articles from PLOS Biology are provided here courtesy of PLOS

    RESOURCES