Abstract
The Holozoa clade comprises animals and several unicellular lineages (choanoflagellates, filastereans, and teretosporeans). Understanding their full diversity is essential to address the origins of animals and other evolutionary questions. However, they are poorly known. To provide more insights into the real diversity of holozoans and check for undiscovered diversity, we here analyzed 18S rDNA metabarcoding data from the global Tara Oceans expedition. To overcome the low phylogenetic information contained in the metabarcoding data set (composed of sequences from the short V9 region of the gene), we used similarity networks by combining two data sets: unknown environmental sequences from Tara Oceans and known reference sequences from GenBank. We then calculated network metrics to compare environmental sequences with reference sequences. These metrics reflected the divergence between both types of sequences and provided an effective way to search for evolutionary relevant diversity, further validated by phylogenetic placements. Our results showed that the percentage of unicellular holozoan diversity remains hidden. We found novelties in several lineages, especially in Acanthoecida choanoflagellates. We also identified a potential new holozoan group that could not be assigned to any of the described extant clades. Data on geographical distribution showed that, although ubiquitous, each unicellular holozoan lineage exhibits a different distribution pattern. We also identified a positive association between new animal hosts and the ichthyosporean symbiont Creolimax fragrantissima, as well as for other holozoans previously reported as free-living. Overall, our analyses provide a fresh perspective into the diversity and ecology of unicellular holozoans, highlighting the amount of undescribed diversity.
Keywords: networks, metabarcoding, 18S, molecular diversity, unicellular Holozoa, novelty
Introduction
The origin of animals from their unicellular ancestor is, undoubtedly, an important evolutionary question. To address this question in the most effective way, we first need to have a well-resolved phylogenetic framework as well as a good understanding of the diversity of the closest unicellular relatives to animals (Ruiz-Trillo et al. 2007). Thanks to phylogenomic analyses, a well-resolved phylogenetic framework of animal origins is now in place. We know that animals are closely related to several unicellular lineages, namely Choanoflagellatea, Filasterea, and Teretosporea (Ichthyosporea and Corallochytrea), all together forming the Holozoa clade (Lang et al. 2002; Ruiz-Trillo et al. 2004, 2008; Shalchian-Tabrizi et al. 2008; Torruella et al. 2012, 2015; Grau-Bové et al. 2017). In contrast, environmental data show us that we still do not have a full understanding of the diversity of Holozoa (del Campo et al. 2015; Arroyo et al. 2018). Therefore, current interpretations on the evolutionary transition toward animal multicellularity may be challenged by improving our knowledge about Holozoa diversity (Ruiz-Trillo et al. 2007; del Campo et al. 2014).
To fill this gap and provide a more accurate perspective on Holozoa diversity and their geographical distribution, we analyzed the longest and largest metabarcoding marine data set: the Tara Oceans expedition, which is based on the 18S ribosomal RNA gene (hereafter 18S or 18S rDNA) (de Vargas et al. 2015; Pesant et al. 2015). Tara Oceans comprise thousands of reads from hundreds of sampling stations around the globe, with a third of those reads not matching any reference in databases (de Vargas et al. 2015). However, a drawback of this data set is the absence of full-length 18S sequences, being composed by the relatively small V9 region (∼130 bp long), located at the end of the 18S (Hugerth et al. 2014).
To overcome the issue of the limited phylogenetic signal, we decided to analyze the Tara Oceans data set using gene similarity networks. Networks have been preferentially applied to study ecological interactions, such as predator–prey, parasite–host, or mutualism (Logares et al. 2014; Krabberød et al. 2017; Layeghifard et al. 2017; Pilosof et al. 2017; Valverde et al. 2018). Networks are now becoming widely adopted to explain complex evolutionary processes, such as horizontal gene transfer, gene domain fusion, and gene or genome introgression (Corel et al. 2016; Pathmanathan et al. 2018; Ocaña-Pallarès et al. 2019). To our knowledge, there are very few metabarcoding studies that used networks to describe novelty in metabarcoding data sets (Forster et al. 2015, 2019), even though this methodology offers a structure to test evolutionary questions in massive high-throughput data and to mine large data sets for sequences of interest.
Our analyses showed novel unicellular Holozoa diversity, in particular within Choanoflagellatea and Ichthyosporea. Specifically, we found unicellular Holozoa operational taxonomic units (OTUs) branching off several acanthoecid subgroups (e.g., Choanoflagellate H), Syssomonas multiformis and Creolimax fragrantissima. We also retrieved 15 Filasterea-related OTUs, detecting this clade for the very first time in an environmental survey. Interestingly, we also identified a putative novel unicellular Holozoa group, composed of 21 OTUs (6,244 reads in total)., that could not be located within any other known lineage and may represent a novel lineage (here tentatively named as MASHOL, for marine small Holozoa clade). We also observed that the freshwater environmental group FRESCHO3 could have diverged from a marine clade, showing another marine-to-freshwater transition in choanoflagellates. Finally, our co-occurrence analyses suggested potential novel associations between animals and ichthyosporeans. For example, the ichthyosporean C. fragrantissima could be associated with a broader range of animal hosts than previously described.
Results and Discussion
Initial Data Sets and Network Construction
To look for potential new diversity of unicellular Holozoa and to address their geographical distribution, we combined two 18S rRNA data sets: an environmental data set of OTUs and a reference data set with known holozoan sequences. The environmental data set came from the worldwide Tara Oceans expedition (de Vargas et al. 2015), which included metabarcoding data from the V9 region of the 18S rRNA gene from a total of 1,086 samples from 210 oceanic stations, 3 water column layers, and 10 size fractions (further details about sampling procedures can be found in Pesant et al. 2015). The reference data set was built by collecting sequences from both GenBank Nucleotide and PR2 databases (see Materials and Methods).
The initial unicellular Holozoa network was built from 2,426 sequences (2,197 from Tara Oceans, 229 from the reference data set). In the network, each node represented either an environmental OTU from Tara Oceans (hereafter ENV) or a sequence from the reference database (hereafter REF) (fig. 1). The basic structure of the network consisted of Connected components (CCs): subgraphs of the network in which there is always a path between all nodes (fig. 2). The initial network was subsequently partitioned using increasing percentages of sequence similarity thresholds (≥85%, ≥87%, ≥90%, ≥95%, and ≥97%), resulting in more fragmented networks (fig. 2). In each of these networks, CCs could be classified in three types: CCs in which all nodes were environmental (CCENV), CC in which all nodes were reference (CCREF), and CC in which there were both types of nodes (CCMIX) (fig. 1).
Networks produced at all thresholds displayed a similar trend: the number of CCENV was always the largest, followed by a CCMIX and CCREF (supplementary fig. 1, Supplementary Material online), which indicated the presence of abundant divergent groups of environmental sequences, independently of the stringency level considered.
Definition of Novelty
To find potential novelty, we then explored the structure of the sequence similarity networks to search for molecular diversity. To do so, we calculated different metrics that are grouped into four categories:
Closeness centrality (fig. 2 and supplementary material 1, Supplementary Material online): It defines to which extent a node (sequence) is central in a network. Typically, a peripheral sequence in a CC is more divergent than the rest of the nodes in this CC because it has less direct neighbors, meaning that peripheral sequences share less similarity with the majority of the sequences with which they cluster. Therefore, we tested whether and which environmental sequences (ENV) were significantly more peripheral than reference sequences (REF) as a way to test whether ENV sequences extends the current known diversity of Holozoa, as well as to identify significantly peripheral ENV nodes.
Preferential association (assortativity, fig. 2, and supplementary material 1, Supplementary Material online): Assortativity quantifies whether nodes that belong to the same category (e.g., ENV or REF) are more connected with each other rather than with nodes from other categories. For example, a significant preferential association between ENV nodes in a network would indicate the existence of groups of similar environmental sequences, distinct from sequences from already described Holozoa.
Network comparison (path analyses by BRIDES) (fig. 2 and supplementary fig. 2, Supplementary Material online): It quantifies the new paths created in an augmented network when new sequences (e.g., ENV) are added to an original network (with only REF), as in Lord et al. (2016). In particular, this allows the evaluation of whether newly added ENV sequences fill in some gaps between the original REF sequences. Typically, Breakthroughs (B paths) and Shortcuts (S paths) indicate that added ENV sequences decrease the topological distance (hence by assumption the putative phylogenetic distance) between known REF sequences. By contrast, Impasses (I paths) indicate that added ENV sequences locate outside short paths between REF sequences in the augmented network.
Shortest path distance (fig. 2): Shortest paths describe the minimal number of edges to connect any pairs of nodes in a network. We used these metrics to quantify a topological distance between ENV and REF nodes in the graph. By definition, increasingly divergent ENV sequences will be located increasingly far from REF sequences. If ENV and REF sequences are located in distinct CCs, there is even no path between them; thus the shortest path distance for such pairs of nodes is infinite.
All these steps of graph-mining were used to detect ENV sequences that could potentially indicate novelty, for which phylogenetic placement could be finally computed.
The Structure of the Unicellular Holozoa Network Shows Potential Undiscovered Diversity
The general structure of the network provided an overview of the unicellular Holozoa diversity and highlighted potential new diversity (fig. 1). First, we computed the closeness of all nodes (figs. 2 and 3 and supplementary material 1, Supplementary Material online) to test whether the distribution of closeness values for REF nodes was: 1) significantly different and 2) significantly higher than the distribution of closeness values for ENV nodes, using Wilcoxon signed-rank test. The results showed that ENV nodes were significantly more peripheral than REF nodes (Wilcoxon signed-rank test, P value < 0.01**) (fig. 3A) in all networks. This result indicates a high amount of potential new diversity in our unicellular Holozoa data set from Tara Oceans. Not only the closeness distributions for REF nodes were significantly higher than that for ENV nodes but also their shapes were different. At ≥85%, ≥87%, and ≥90% identity similarity thresholds, most closeness values of both ENV and REF distributions were low (95% confident interval between 0.2 and 0.4, approximately), and only few nodes presented a closeness value of 1. On the other hand, at ≥95 and ≥97% identity thresholds, when the network was more disconnected into divergent clusters of similar sequences, the distributions of closeness values for ENV nodes were scattered along a wider range of higher closeness values (∼0.2–1). This change reflected the fragmentation of the network into more but smaller CCs.
Next, we analyzed the assortativity, which showed significant preferential connections between ENV sequences. For every network, we computed 1) the distribution of null assortativity values by randomly shuffling the ENV and REF node labels, and we contrasted these values with 2) the assortativity values of all our real networks (see Materials and Methods). All networks were significantly assortative (one-sample t-test, P value < 0.01**) (fig. 3B). This tendency for intragroup preferential linkage suggests a lack of representation of oceanic Holozoa in the reference data set before the Tara Ocean expedition, stressing the high level of potential new diversity present in Tara Oceans data.
Overall, these metrics (closeness and assortativity) indicated that our environmental data set of unicellular holozoans from Tara Oceans was different from the reference data set, expanding the current known diversity of this group.
New Molecular Diversity in Holozoa, Including a Potential Novel Clade
To identify new groups of interest, we first performed network comparisons using BRIDES software (fig. 2 and supplementary fig. 2, Supplementary Material online) (see Materials and Methods and Lord 2016). This allowed us to contrast the topologies of networks built exclusively from REF nodes (original networks) with that in which ENV nodes had been included (augmented networks). BRIDES analysis showed that ENV sequences of unicellular Holozoa created numerous new paths in the augmented similarity networks (fig. 3C), guiding the discovery of evolutionary relevant novel sequences. First, despite the enhanced molecular diversity provided by the Tara Oceans data set, some REF nodes remained disconnected from other REF nodes, indicating that the diversity of most ENV sequences was not close enough to fill the gaps between REF sequences. This was especially noticeable for networks built at high similarity thresholds. At ≥97% threshold, the vast majority of paths were impasses (I), meaning that ENV sequences did not create bridges between REF sequences in the augmented network (supplementary fig. 2, Supplementary Material online). This is logical because, given this high level of stringency, only sequences from the closest related holozoan lineages would connect in a given CC, confirming the general divergent nature of most ENV sequences with respect to sequences from sequenced holozoan taxa. Interestingly, when lowering the similarity threshold required to connect sequences in the networks, the proportion of impasses decreased, showing that some of these divergent ENV sequences started to connect some REF sequences. Still, at ≥85% identity, some Holozoa REF sequences remained disconnected, suggesting that the Tara Oceans data set did not provide evidence for ENV groups bridging phylogenetic gaps between some known Holozoan clades. Possible explanations to this amount of impasses may be: 1) a lack of sufficient sampling effort, 2) the absence of intermediate ENV sequences in marine water columns (there may be in other habitats), 3) the nature of the Holozoa clade, which may be comprised some significantly divergent lineages without extant intermediate diversity between them, or 4) that most ENV sequences belong to groups branching outside currently described Holozoans.
On the other hand, breakthroughs (B) and shortcuts (S) were increasingly observed in networks at lower thresholds (fig. 3C). These two types of paths correspond to sequences that introduce either new paths between known Holozoan groups (B) or new ENV sequences closely related to known groups, and likely belonging to known clades (S). Thus, under the hypothesis that an intermediate position in the network reflected an intermediate phylogenetic position in the corresponding sequence phylogeny (Atkinson et al. 2009; Méheust et al. 2018), we assumed B paths could potentially indicate ENV sequences branching in between two phylogenetically distant groups of Holozoans in a phylogenetic tree, whereas S paths may potentially indicate ENV sequences branching within a less divergent group of sampled holozoans (supplementary fig. 2, Supplementary Material online). Overall, the presence of a high proportion of B and S paths (36.93% at ≥85%, 33.22% at ≥87%, 45.42% at ≥90% ID) suggested that Tara Oceans data hinted at the existence of novel, phylogenetically relevant, holozoan diversity.
To corroborate the potential novelty of those sequences and have a better understanding of their phylogenetic position within Holozoa, we performed phylogenetic placement analyses (see Materials and Methods). In particular, we analyzed the OTUs that created breakthroughs and shortcuts in the network at 85% similarity threshold (fig. 3D). These OTUs unraveled novelty within Acanthoecida, one of the two subgroups of Choanoflagellatea. A group of 6 sequences or OTUs (with a total of 1,675 reads) branched off Choanoflagellate H, suggesting a potential novel environmental group of acanthoecids. Another group of 3 sequences (including one of the most abundant OTUs in the whole Tara Oceans data set: OTU 2703, with >28,000 reads) appeared to be the sister group of Choanoflagellate G. The importance of this result lies in the fact these OTUs did not cluster together with the already morphologically described Choanoflagellate G species (i.e., Acanthocorbis unguiculata, Acanthoeca spectabilis, Savillea micropora, Helgoeca nana), but branched at an internal node, showing their divergent nature. We also recovered the second earliest diverging acanthoecid (OTU 5953, with 7,448 reads), splitting apart from the reference sequence JQ223245, which had already been identified as a divergent choanoflagellate (del Campo et al. 2015). Finally, several OTUs clustered within freshwater environmental choanoflagellate groups, such as FRESCHO3 or FRESCHO1, which shows a wider ecosystem range in which these species can inhabit. We confirmed the good quality of these phylogenetic placements gauging the likelihood and distance between placements (supplementary fig. 3A and B, Supplementary Material online). Alignments and the full tree of figure 3D can be found in supplementary material 2, Supplementary Material online.
Our second approach to examine in detail the novelty in unicellular Holozoa was to perform a shortest path distance analyses between every ENV node and its closest REF node in the network (fig. 4). The longer the topological distance between REF and ENV nodes, the more divergent the ENV sequence is, because many steps are required to reach the nearest REF sequence. The most extreme case is the infinite distance, shown by ENV nodes belonging to exclusively environmental CCs. Our results showed that indirect connections to REF (when there are >1 step from ENV to REF) were the most abundant, ranging from 92.5% of all ENV nodes at ≥85% ID similarity network to 69.83% at ≥97% identity (fig. 4A). In addition, networks at higher similarity thresholds (≥95% identity and ≥97% identity) exhibited a high proportion of infinite distances (15.39% of ENV nodes at ≥95% similarity threshold; 30.56% at ≥97% similarity threshold) (fig. 4A). We then extracted those distant ENV OTUs to perform phylogenetic placement against a curated reference Holozoa tree (see Materials and Methods). The deepest novelty (understood as the diversity that lays in deeper, more internal nodes in the Holozoan tree) was observed in the networks at ≥95% and ≥97% thresholds. We performed a specific phylogenetic placement of this deep novelty, shown in figure 4B. A group of 21 OTUs with a total abundance of 6,244 reads was located in the most internal branch outside Choanoflagellata, specifically scattered across the internal branches of choanoflagellates and Syssomonas multiformis. These OTUs were mainly recovered in the pico (0.8–3/5 μm) and nano (3/5–20 μm) fraction sizes from the Indian Ocean and Mediterranean Sea. Inspired by its uncertain phylogenetic position and the small size, we tentatively named this group as MASHOL (standing for MArine Small HOLozoa). The quality of the placement test revealed that the placements had very low likelihood weight ratios (supplementary fig. 3D, Supplementary Material online), although all of them were located around the same internal branches in the tree. As Mahé et al. (2017) pointed out, these low-probability placements do not necessarily mean that they are incorrect, but they hold a high molecular distance with the reference sequences in the tree. This result indicates that these OTUs do not really belong to any of the already known unicellular holozoan lineages, although its exact position remains uncertain. In any case, they probably represent a novel clade among Holozoa.
Unicellular Holozoans Are Globally Distributed, with Some Lineages Showing Specific Geographical Patterns
There is no data on the geographical distribution of unicellular Holozoa. Thus, we decided to take the most of the Tara Oceans data set and evaluate the geographical distribution of the different unicellular holozoan lineages across oceans, layers of the water column, and size fractions. In general, all lineages of unicellular Holozoa were widely distributed across the world’s oceans (fig. 5A). Ichthyosporeans were the most homogeneously dispersed group across all oceans. There were, however, some exceptions. For example, Acanthoecida choanoflagellates were more abundant in the Arctic samples (60.29% of total abundance), and in contrast to Craspedida (4.5%) (fig. 5A). These results are consistent with previous morphological studies of choanoflagellates in sea ice (Thomsen et al. 1997). OTUs assigned to Filasterea were widely distributed, but their abundance was higher in the samples coming from the South Pacific Ocean (43.37%), Red Sea (24.7%), and Indian Ocean (16.97%) (fig. 5A). OTUs related to Corallochytrea group were widely distributed, although the OTU with the highest abundance (OTU 30781, 248 reads) was mainly located in the North Pacific Ocean (fig. 5A). Both the Indian Ocean and the Arctic Ocean held 30% of the reads of corallochytreans (fig. 5A). On the contrary, the presence of corallochytreans in the Atlantic Ocean seemed to be insignificant. Regarding the environmental groups Marine Opisthokonts 1 and 2 (MAOP1 and MAOP2, respectively), they showed a pattern of distribution similar to Choanoflagellata. MAOP2 appeared to be most abundant and with more OTUs than MAOP1, in contrast to what had been found in European coastal waters (del Campo et al. 2015). Moreover, although MAOP1 was not found in the Arctic or Antarctic Oceans, MAOP2 exhibited 36% of its abundance in the Arctic, expanding to the maximum the range of geographical locations in which this environmental group has been found up to now (fig. 5A) (Romari and Vaulot 2004; Amacher et al. 2009; Edgcomb et al. 2011; Marshall and Berbee 2011). Assortativity coefficients of geographical distribution across oceans and oceanic provinces showed positive values in all networks (supplementary table, Supplementary Material online). Even though these values were not very high (a range from 0.016 in the network at ≥85% identity similarity threshold to 0.046 in the network at ≥97% identity), it shows a tendency of OTUs from the same geographical region to be more associated between them, hence genetically more similar, than with OTUs from other regions.
Regarding the depth in the water column, the majority of the unicellular Holozoans were preferentially located in the surface or the deep chlorophyll maximum (DCM) layers (fig. 5B). This tendency indicates that holozoan sequences in the upper layers were more similar than those sampled at lower depths (positive assortativity, supplementary table, Supplementary Material online). Even though these are low positive numbers, they were significantly different from the random shuffled distribution (one-sample t-test, P value < 0.01**), which supported the tendency for a shallower preference location.
Finally, unicellular holozoans were recovered from a wide range of size fractions (fig. 5C). For example, within Choanoflagellata, the majority of Acanthoecida abundance (69.37%) was present in the nano fraction (3/5–20 μm), followed by 19.4% in the pico fraction (0.8–3/5 μm). Filasterean reads were mainly found in meso (43.18%) and nano (46.21%) fractions. Ichthyosporeans had a different pattern of sizes (fig. 5C). The distribution of Dermocystida reads was shifted toward the largest fractions (10.96%, 19.98%, and 57.73% in meso, micro, and nano fractions, respectively). On the contrary, the distribution of Ichthyophonida reads was shifted toward the smallest fractions (24.46% in nano and 61.97% in pico fractions). OTUs associated with Corallochytrea were preferentially found in the pico, nano, and pico–nano fractions (0.8–20 μm). Finally, both MAOP groups were more present in the smallest fractions: nano (54.94%) and pico (37.81%), which differ from previous findings that showed MAOP dominating the micro fraction (del Campo and Ruiz-Trillo 2013). Nevertheless, these results are consistent with these authors, who already suggested that MAOP group might be composed by species with different sizes. The MAOP group might also undergo a life cycle with several stages that include different cell sizes. The preferential location of different holozoan lineages in different size fractions can be seen in the assortativity values (supplementary table, Supplementary Material online). In all networks, assortativity coefficients of fraction sizes were the highest among all elements considered (depths, oceanic provinces, oceans, and size). These values were also significant compared with the distribution of randomly shuffled labels (one-sample t-test, P value < 0.01**), indicating a tendency for similar Holozoa sequences to be found in specific size fraction, compared with other sizes.
Co-Occurrence of Creolimax fragrantissima and Its Animal Hosts
Some of these unicellular species, especially the Ichthyosporea, have been previously described as animal parasites or symbionts (Mendoza et al. 2002; Glockling et al. 2013). To see whether our data could illuminate us on this aspect, we checked if there was any association between the presence of unicellular Holozoa and animals.
Our results showed that there were indeed significant positive and negative correlations between unicellular Holozoa and animals (fig. 6A). The strongest positive correlation (Spearman’s rank correlation coefficient, ρS=0.6–0.8, P < 0.01**) was shown between OTUs associated with C. fragrantissima and several animal phyla such as Entoprocta (Barentsiidae), Mollusca (Polyplacophora), Tardigrada, and Porifera (Homoscleromorpha, Calcarea, and Demospongiae). To see if we could detect other associations but monotonic and linear (as Spearman and Pearson describe, respectively), we used a bipartite network (fig. 6B). We corroborated the previous finding of C. fragrantissima with several animal phyla, specifically with Polyplacophora (ρS=0.465), Calcarea (ρS=0.352), and Demospongiage (ρS=0.311). Creolimax fragrantissima was isolated 27 times from invertebrate guts, mostly from a sipunculid species, but also one tunicate, sea cucumber, and chiton (Marshall et al. 2008). Thus, our results corroborated some symbiotic relationships (with Polyplacophora, commonly known as chiton) and suggested some other putative hosts (Entoprocta, Tardigrada, and Porifera).
We also found that the environmental group Marine Ichthyosporea 1 (MAIP1) was associated with Acoelomorpha, Arthropoda (Hexapoda, Crustacea), Bryozoa, Cnidaria, Nematoda (Enoplea), and Chordata (Tunicata, Craniata). This result suggests that the environmental group MAIP1 may be associated with animal phyla and not being exclusively free-living. Another interesting result was the interaction between MAOP2 and Ctenophora (ρS=0.409) or Mollusca (Cephalopoda) (ρS=0.317), which could imply that these taxa use the same resources or have some ecological interaction, as it was found for other environmental groups (Lima-Mendez et al. 2015; Lambert et al. 2019).
Regarding MASHOL, the potential new Holozoa group described here, no strong correlations could be found with any animal group, suggesting that this environmental group might be free-living or not have a strong association with any particular animal phyla.
Overall, these results suggest more complex ecological interactions between parasitic/symbiotic unicellular holozoans and animals than what it is currently known. These biotic effects (grazing, pathogenicity, and parasitism) have been reported to explain 82% of the variability in the Tara Oceans interactome, giving a greater importance to these interspecific connections (Lima-Mendez et al. 2015). This also implies that sampling within animal phyla may still be a useful method to isolate new species from unicellular holozoans. However, we refuse to claim that correlation implies causation. What is certain though is that metabarcoding has a great power to assess diversity in its multiple forms, from pure ecological and evolutionary studies to applied conservationism, which is of vital importance in a world of threat to biodiversity.
Conclusions
Our analysis of metabarcoding data from Tara Oceans using sequence similarity networks shows a greater diversity of unicellular holozoans than previously sampled, including a potential novel clade. Our data also demonstrate global geographical distribution from most unicellular holozoans and pinpoint to potential associations with different animal phyla.
Materials and Methods
Data Sets
The initial environmental data set was provided by the Tara Oceans consortium, which contained a total of 474,303 OTUs from all eukaryotic clades. Note that this is the full data set generated in the expedition, not the one used in de Vargas et al. (2015), as the latter is a subsample of the former. The Tara Oceans consortium provided us with this data set already cleaned, filtered, and clustered. During the first steps of the bioinformatic pipeline, they merged, dereplicated, and quality filtered the original V9 barcodes. A chimera detection analysis was carried out using the usearch program (Edgar et al. 2011). After a filtering process to discard possible spurious reads, barcodes were clustered using Swarm approach (Mahé et al. 2017). For further details on the OTU table generation, see http://taraoceans.sb-roscoff.fr/EukDiv/.
Our reference database was obtained by merging three different databases: GenBank, PR2-Opistho, and PR2_V9. First, we downloaded two databases from GenBank: nucleotide (nt) and environmental nucleotide (env_nt) by January 25, 2018. We retrieved 18S rDNA sequences from these databases by searching them using the human 18S sequence as a query (AC139250, positions 551,257–553,055). This sequence had been previously confirmed to contain the Tara Oceans V9 primer sequences. BlastN parameters were: E-value <1E-10, percentage of identity ≥60%, and maximum target sequences of 9.9×107 (for nt) and 9.9×108 (for env_nt). From the BlastN output, we implemented two filtering processes. In the first one, we retrieved the sequences that contained both Tara Oceans V9 primer sequences. We then trimmed the sequences to have only the V9 region. In the second step, we kept those sequences whose length was comprised between 80 and 120 bp to keep the most frequent length range of this region (Amaral-Zettler et al. 2009). The second database, PR2-Opistho, was a well-curated and updated version of the original PR2 database for Opisthokonta clade. This database (PR2-Opistho) was also trimmed with the Tara Oceans primer sequences to keep only the V9 region. The third database, PR2_V9, was generated by the Tara Oceans consortium (de Vargas et al. 2015). Because both PR2-Opistho and PR2_V9 were originally generated from PR2 database, we eliminated redundancies and kept the taxonomical annotation from the PR2-Opistho database. Finally, we combined all databases, producing a global reference database of 49,379 eukaryotic sequences.
To retrieve the unicellular Holozoa sequences, we performed a phylogenetic placement of both environmental and reference data sets against a eukaryotic reference tree and took those that branched within Holozoa and outside animals. A phylogenetic placement consists of mapping short amplicons (in this case, Tara Oceans OTUs) into a fixed reference tree made from full-length 18S rDNA sequences. This reference was constructed using 130 full 18S sequences that covered all eukaryotic groups. We performed the phylogenetic placement using the RAxML-EPA algorithm (Berger et al. 2011) and we selected the sequences that were placed into unicellular Holozoa using the C++ script extract_clade_placements from Genesis software v0.18.1 (Czech and Stamatakis 2016). Therefore, the starting data set of unicellular Holozoa contained 2,426 sequences (2,197 were environmental from Tara Oceans, whereas 229 were reference sequences). This data set can be found in supplementary material 4, Supplementary Material online.
Similarity Network Construction
We built the initial similarity network based on a BLAST all-against-all of the unicellular Holozoa data set. We used BlastN v2.7.1+ (Camacho et al. 2009), with the following options: E-value <1E-10, percentage of identity ≥85%, maximum number of HSPs 1, and maximum target sequences 3,000.
We used the cleanblastp script from CompositeSearch software to filter the output in order to remove auto-loops and reciprocal connections (A–B would be the same as B–A) (Pathmanathan et al. 2018). Final networks were obtained by setting up a mutual cover threshold of ≥95% and increasing sequence similarity thresholds: ≥85%, ≥87%, ≥90%, ≥95%, and ≥97% identity threshold, respectively. These networks can be found in supplementary material 4, Supplementary Material online.
Network Node Annotation
In order to annotate taxonomically every node in the network, we performed a BLAST of the initial 2,426 holozoan sequences against the PR2-Opistho database, using the following parameters: E-value <1E-50 and ≥97% percentage of identity. Under these conditions, only 438 sequences could be annotated. Thus, we decided to use a phylogenetic method to taxonomically assign the rest of the unannotated OTUs: tax2tree algorithm (McDonald et al. 2012). This software requires the structure of the phylogenetic tree of both reference and unannotated sequences. Then, it assigns the taxonomy to the unannotated tips, given a file with the taxonomical information of the annotated tips. We could successfully annotate 1,503 additional sequences. Thus, a total of 1,941 sequences (78.8% of the initial data set) could be taxonomically annotated.
Sequence Similarity Network Analyses
To address the molecular diversity and novelty of unicellular Holozoa, we analyzed topological metrics, as well as closeness and assortativity using NetworkX v2.1 library on python 3.5.1 (Hagberg et al. 2008).
Novelty Assessment: Preferential Connection
Assortativity is a property of the network that measures the preferential connection between nodes belonging to the same group (Newman 2003; Forster et al. 2015) (fig. 2). To compute its significance, we first calculated a distribution of null assortativity values for each network, randomly distributing the same amount of node labels to the ones existing (e.g., REF and ENV) under test. The reason is that a random null assortativity value may be different from 0, given the structure of the graph and the group sizes of the tested labels (fig. 2 and supplementary material 1, Supplementary Material online). Next, in the standard protocol, we randomly shuffled the labels of the nodes 100 times while keeping the same network topology. For example, one ENV node (i.e., a node composed of an environmental sequence) could turn out to be ENV or REF (i.e., a node composed of a reference sequence) after the shuffling. For all these 100 random networks, we computed the assortativity, generating the distribution of assortativity values for random networks. We next computed the actual value of assortativity in the networks (fig. 3B and supplementary table, Supplementary Material online), for each tested pairwise comparison of categories to calculate the P values of our observations (ENV vs. REF; IND vs. MEDIT vs. ARCTIC vs. ANTAR vs. NPAC vs. SPAC vs. NATL vs. SATL vs. REDS; SURF vs. DCM vs. MES vs. MIX vs. ZZZ; MESO vs. MICRO_MESO vs. MICRO vs. NANO vs. PICO_NANO vs. PICO_MICRO vs. PICO).
Novelty Assessment: BRIDES
BRIDES software characterizes new paths that are created when extra nodes are added to an original network (Lord et al. 2016). For every sequence similarity network, we first used only the REF nodes (original network), and then we added the ENV nodes of unicellular Holozoa (augmented networks) to compute BRIDES using the default parameters.
Novelty Assessment: Phylogenetic Placement
In order to validate the putative novel diversity previously obtained with BRIDES and shortest path analyses, we performed a phylogenetic placement of the OTUs into our curated reference Holozoa tree, which can be found in supplementary material 5, Supplementary Material online. We aligned the sequences using PaPaRa with default parameters (Berger and Stamatakis 2011) and manually examined the alignment and corrected wrong positions in Geneious v9.0.5 (Kearse et al. 2012). We then trimmed the nonhomologous positions with trimAl 1.4.rev15, setting the gap threshold option at 0.2 for the alignment of selected sequences found on B and on S paths by ourBRIDES analysis (Capella-Gutiérrez et al. 2009). Regarding the alignment of divergent sequences identified by our shortest path analyses, the trimming was done manually, removing those positions with a mean pairwise identity over all pairs <30%. We performed the phylogenetic placement using the RAxML-EPA algorithm (Berger et al. 2011). The final tree in figure 4B was enhanced using iTOL (Letunic and Bork 2007).
We validated the quality of the phylogenetic placement using the placement_histograms script from Genesis package v0.18.1 (Czech and Stamatakis 2016). The first parameter computed was the EDPL (Expected Distance between Placement Locations). For every OTU, it calculates the weighted distance between all placement positions. In other words, EDPL quantifies to which extent all placements from an OTU are scattered over the tree. In both groups, EDPL values were extremely small (<0.05) (supplementary fig. 2A and C, Supplementary Material online). Considering that most branches in the tree had <0.05 nucleotide substitutions per site, it meant that the majority of the OTUs were located within the same branch. However, the quality of these placements was not high, measured as the distribution and frequency of likelihood weight ratio values (LWR). This was especially drastic in the placements of MASHOL OTUs (supplementary fig. 2D, Supplementary Material online), which shows the uncertainty in the location of the group.
Geographical Distribution
We described the geographical distribution of unicellular Holozoa lineages, as well as the distribution along the water column and size fractions, through circular layouts using “circlize” package in Rstudio (Gu et al. 2014; RStudio 2017).
Co-occurrence Patterns
To test the association between unicellular Holozoa and animal OTUs, we carried out a co-occurrence analysis. First, we filtered the data set to keep those OTUs that were present in at least 3 samples (out of 1,086 total samples in Tara Oceans). Then, we summed up OTU abundances if these OTUs belonged to the same class in animals or the same genus/species in unicellular Holozoa. We used “corrplot” and “Hmisc” libraries in Rstudio v.1.1.383 to perform the analyses (RStudio 2017; Wei et al. 2017; Harrell 2019). These consist of building a correlation matrix among all pairwise comparisons and then extract the significant relationships (Spearman’s significance < 0.01**), which finally were plotted in a heatmap.
There was a possibility, however, that some associations could be neither monotonic nor linear. In that case, we would not be able to detect them using Spearman’s or Pearson’s correlation coefficients. We used instead MICtools package (Albanese et al. 2018), which is able to identify a wider range of relationships in large data sets and assess their statistical significance. Final networks were created using Cytoscape 3.3.0 (Shannon et al. 2003).
Supplementary Material
Acknowledgments
We thank Ramón Massana, Philippe Lopez, and Ramiro Logares for discussion on the article. This work was supported by grants (BFU2014-57779-P and BFU2017-90114-P) from Ministerio de Economía y Competitividad (MINECO), Agencia Estatal de Investigación (AEI), and Fondo Europeo de Desarrollo Regional (FEDER) to I.R.-T.
Literature Cited
- Albanese D, Riccadonna S, Donati C, Franceschi P.2018. A practical tool for maximal information coefficient analysis. Gigascience 7(4):1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Amacher J, Neuer S, Anderson I, Massana R.2009. Molecular approach to determine contributions of the protist community to particle flux. Deep Sea Res. 56(12):2206–2215. [Google Scholar]
- Amaral-Zettler LA, McCliment EA, Ducklow HW, Huse SM.2009. A method for studying protistan diversity using massively parallel sequencing of V9 hypervariable regions of small-subunit ribosomal RNA genes. PLoS One 4(7):e6372.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Arroyo AS, López-Escardó D, Kim E, Ruiz-Trillo I, Najle SR.2018. Novel diversity of deeply branching holomycota and unicellular holozoans revealed by metabarcoding in middle Paraná River, Argentina. Front Ecol Evol. 6:99. [Google Scholar]
- Atkinson HJ, Morris JH, Ferrin TE, Babbitt PC.2009. Using sequence similarity networks for visualization of relationships across diverse protein superfamilies. PLoS One 4(2):e4345.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berger SA, Krompass D, Stamatakis A.2011. Performance, accuracy, and web server for evolutionary placement of short sequence reads under maximum likelihood. Syst Biol. 60(3):291–302. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berger SA, Stamatakis A.2011. Aligning short reads to reference alignments and trees. Bioinformatics 27(15):2068–2075. [DOI] [PubMed] [Google Scholar]
- Camacho C, et al. 2009. BLAST+: architecture and applications. BMC Bioinformatics 10(1):421.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T.2009. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25(15):1972–1973. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Corel E, Lopez P, Méheust R, Bapteste E.2016. Network-thinking: graphs to analyze microbial complexity and evolution. Trends Microbiol. 24(3):224–237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Czech L, Stamatakis A.2016. Genesis. a toolkit for working with phylogenetic data. Available from: https://github.com/lczech/genesis.
- del Campo J, et al. 2014. The others: our biased perspective of eukaryotic genomes. Trends Ecol Evol. 29(5):252–259. [DOI] [PMC free article] [PubMed] [Google Scholar]
- del Campo J, et al. 2015. Diversity and distribution of unicellular opisthokonts along the European coast analysed using high-throughput sequencing. Environ Microbiol. [DOI] [PMC free article] [PubMed]
- del Campo J, Ruiz-Trillo I.2013. Environmental survey meta-analysis reveals hidden diversity among unicellular opisthokonts. Mol Biol Evol. 30(4):802–805. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Vargas C, et al. 2015. Eukaryotic plankton diversity in the sunlit ocean. Science 348(6237):1261605–1261612. [DOI] [PubMed] [Google Scholar]
- Edgcomb V, et al. 2011. Protistan microbial observatory in the Cariaco Basin, Caribbean. I. Pyrosequencing vs Sanger insights into species richness. ISME J. 5(8):1344–1356. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Forster D, et al. 2015. Testing ecological theories with sequence similarity networks: marine ciliates exhibit similar geographic dispersal patterns as multicellular organisms. BMC Biol. 13(1):16.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Forster D, et al. 2019. Improving eDNA-based protist diversity assessments using networks of amplicon sequence variants. Environ Microbiol. 21(11):4109–4124. [DOI] [PubMed] [Google Scholar]
- Glockling SL, Marshall WL, Gleason FH.2013. Phylogenetic interpretations and ecological potentials of the Mesomycetozoea (Ichthyosporea). Fungal Ecol. 6(4):237–247. [Google Scholar]
- Grau-Bové X, et al. 2017. Dynamics of genomic innovation in the unicellular ancestry of animals. Elife 6:e26036.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gu Z, Gu L, Eils R, Schlesner M, Brors B.2014. circlize implements and enhances circular visualization in R. Bioinformatics 30(19):2811–2812. [DOI] [PubMed] [Google Scholar]
- Hagberg AA, Schult DA, Swart PJ.2008. Exploring network structure, dynamics, and function using NetworkX. In: Varoquaux G, Vaught T, Millman J, editors. Proceedings of the 7th Python in Science Conference (SciPy2008). Pasadena (CA: ). p. 11–15. [Google Scholar]
- Harrell FE.2019. Hmisc: Harrell miscellaneous. Available from: https://github.com/harrelfe/Hmisc.
- Hugerth LW, et al. 2014. Systematic design of 18S rRNA gene primers for determining eukaryotic diversity in microbial consortia. PLoS One 9(4):e95567. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kearse M, et al. 2012. Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics 28(12):1647–1649. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Krabberød AK, Bjorbækmo MFM, Shalchian-Tabrizi K, Logares R.2017. Exploring the oceanic microeukaryotic interactome with metaomics approaches. Aquat Microb Ecol. 79(1):1–12. [Google Scholar]
- Lambert S, et al. 2019. Rhythmicity of coastal marine picoeukaryotes, bacteria and archaea despite irregular environmental perturbations. ISME J. 13(2):388–401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lang BF, O’Kelly C, Nerad T, Gray MW, Burger G.2002. The closest unicellular relatives of animals. Curr Biol. 12(20):1773–1778. [DOI] [PubMed] [Google Scholar]
- Layeghifard M, Hwang DM, Guttman DS.2017. Disentangling interactions in the microbiome: a network perspective. Trends Microbiol. 25(3):217–228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Letunic I, Bork P.2007. Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation. Bioinformatics 23(1):127–128. [DOI] [PubMed] [Google Scholar]
- Lima-Mendez G, et al. 2015. Determinants of community structure in the global plankton interactome. Science 348(6237):1262073–1262073. [DOI] [PubMed] [Google Scholar]
- Logares R, et al. 2014. Patterns of rare and abundant marine microbial eukaryotes. Curr Biol. 24(8):813–821. [DOI] [PubMed] [Google Scholar]
- Lord E, et al. 2016. BRIDES: a new fast algorithm and software for characterizing evolving similarity networks using breakthroughs, roadblocks, impasses, detours, equals and shortcuts. PLoS One 11(8):e0161474.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mahé F, et al. 2017. Parasites dominate hyperdiverse soil protist communities in Neotropical rainforests. Nat Ecol Evol. 1. [DOI] [PubMed] [Google Scholar]
- Marshall WL, Berbee ML.2011. Facing unknowns: living cultures (Pirum gemmata gen. nov., sp. nov., and Abeoforma whisleri, gen. nov., sp. nov.) from invertebrate digestive tracts represent an undescribed clade within the unicellular opisthokont lineage Ichthyosporea (Mesomycetozoea). Protist 162(1):33–57. [DOI] [PubMed] [Google Scholar]
- Marshall WL, Celio G, McLaughlin DJ, Berbee ML.2008. Multiple isolations of a culturable, motile Ichthyosporean (Mesomycetozoa, Opisthokonta), Creolimax fragrantissima n. gen., n. sp., from marine invertebrate digestive tracts. Protist 159(3):415–433. [DOI] [PubMed] [Google Scholar]
- McDonald D, et al. 2012. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J. 6(3):610–618. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Méheust R, et al. 2018. Hundreds of novel composite genes and chimeric genes with bacterial origins contributed to haloarchaeal evolution. Genome Biol. 19(1):75.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mendoza L, Taylor JW, Ajello L.2002. The class Mesomycetozoea: a heterogeneous group of microorganisms at the animal-fungal boundary. Annu Rev Microbiol. 56(1):315–344. [DOI] [PubMed] [Google Scholar]
- Newman MEJ.2003. Mixing patterns in networks. Phys Rev E. 67(2):1–13. [DOI] [PubMed] [Google Scholar]
- Ocaña-Pallarès E, Najle SR, Scazzocchio C, Ruiz-Trillo I.2019. Reticulate evolution in eukaryotes: origin and evolution of the nitrate assimilation pathway. PLoS Genet. 15(2):e1007986.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pathmanathan JS, Lopez P, Lapointe F-J, Bapteste E.2018. CompositeSearch: a generalized network approach for composite gene families detection. Mol Biol Evol. 35(1):252–255. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pesant S, et al. 2015. Open science resources for the discovery and analysis of Tara Oceans data. Sci Data. 2:150023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pilosof S, Porter MA, Pascual M, Kéfi S.2017. The multilayer nature of ecological networks. Nat Ecol Evol. 1:0101. [DOI] [PubMed] [Google Scholar]
- Romari K, Vaulot D.2004. Composition and temporal variability of picoeukaryote communities at a coastal site of the English Channel from 18S rDNA sequences. Limnol Oceanogr. 49(3):784–798. [Google Scholar]
- RStudio T. 2017. Rstudio: integrated development for R. Available from: http://www.rstudio.com.
- Ruiz-Trillo I, et al. 2004. Capsaspora owczarzaki is an independent opisthokont lineage. Curr Biol. 14(22):R946–R947. [DOI] [PubMed] [Google Scholar]
- Ruiz-Trillo I, et al. 2007. The origins of multicellularity: a multi-taxon genome initiative. Trends Genet. 23(3):113–118. [DOI] [PubMed] [Google Scholar]
- Ruiz-Trillo I, Roger AJ, Burger G, Gray MW, Lang BF.2008. A phylogenomic investigation into the origin of Metazoa. Mol Biol Evol. 25(4):664–672. [DOI] [PubMed] [Google Scholar]
- Shalchian-Tabrizi K, et al. 2008. Multigene phylogeny of Choanozoa and the origin of animals. PLoS One 3(5):e2098.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shannon P, et al. 2003. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13(11):2498–2504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thomsen HA, Garrison DL, Kosman C.1997. Choanoflagellates (Acanthoecidae, Choanoflagellida) from the Weddell sea, Antarctica, taxonomy and community structure with particular emphasis on the ice biota; with preliminary remarks on Choanoflagellates from Arctic sea ice (Northeast Water Polynya). G Arch Protistenkd. 148(1–2):77–114. [Google Scholar]
- Torruella G, et al. 2012. Phylogenetic relationships within the Opisthokonta based on phylogenomic analyses of conserved single-copy protein domains. Mol Biol Evol. 29(2):531–544. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Torruella G, et al. 2015. Phylogenomics reveals convergent evolution of lifestyles in close relatives of animals and fungi. Curr Biol. 25:1–7. [DOI] [PubMed] [Google Scholar]
- Valverde S, et al. 2018. The architecture of mutualistic networks as an evolutionary spandrel. Nat Ecol Evol. 2(1):94–99. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wei T, et al. 2017. corrplot: visualization of a correlation matrix. Available from: https://github.com/taiyun/corrplot.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.