Summary
Today’s genomics workflows typically require alignment to a reference sequence, which limits discovery. We introduce a new unifying paradigm, SPLASH (Statistically Primary aLignment Agnostic Sequence Homing), an approach that directly analyzes raw sequencing data to detect a signature of regulation: sample-specific sequence variation. The approach, which includes a new statistical test, is computationally efficient and can be run at scale. SPLASH unifies detection of myriad forms of sequence variation. We demonstrate that SPLASH identifies complex mutation patterns in SARS-CoV-2 strains, discovers regulated RNA isoforms at the single cell level, documents the vast sequence diversity of adaptive immune receptors, and uncovers biology in non-model organisms undocumented in their reference genomes: geographic and seasonal variation and diatom association in eelgrass, an oceanic plant impacted by climate change, and tissue-specific transcripts in octopus. SPLASH is a new unifying approach to genomic analysis that enables an expansive scope of discovery without metadata or references.
One-sentence summary:
SPLASH is a unifying, statistically driven approach to biological discovery from raw sequencing data, bypassing alignment.
Introduction
Genomics is now foundational to biology, ecology and medicine, and as sequencing databases grow, so too does the opportunity to leverage them for discovery. How can this data best be analyzed to reveal regulation and function?
Traditionally, bioinformatic pipelines start by assigning genomic positions to reads via alignment to a reference genome. While alignment has served reasonably well for decades, it has many limitations. For less studied organisms, references are often incomplete and or partially misassembled; and for many organisms, there is no reference. Reliance on references risks missing sequences that are not present in the reference, a major problem even in the intensely studied human genome: for example, genetic variants associated with under-studied populations are poorly represented in current references1, undermining the impact of genomic studies and exasperating health disparities. Some diseases such as cancer are almost defined by their changes from the reference, and vary even within a single tumor, so reference-based approaches are inaccurate approximations of the underlying biology. Finally, the enormous diversity of viral and microbial genomes, and their constant adaptation2,3 makes it infeasible to define a complete set of references. Practically, alignment to references is computationally slow, limiting the scale of genomic inference.
Another challenge for reference-based methods is dealing with repetitive elements: aligners are forced to assign mapping positions to reads that simply do not have a unique alignment. Repetitive regions comprise ~54% of the human genome4 and may be important to analyze, but are often just ignored because alignment is difficult or impossible. Further, many genes have sequence-similar paralogs, which pose similar alignment issues.
Testing statistical hypotheses after alignment is also problematic. For reads that do map, the complex algorithms underlying sequence alignment, which may have ad hoc and hidden assumptions, are difficult to model; thus statistical statements about differences (from SNPs to splicing) are difficult if not impossible to obtain in closed form. Permutation-based methods do not solve this problem: they are capable of giving 10-fold underestimates of the false discovery rate (FDR)5,6. Even simple tasks such as calling allele-specific expression in well-studied genomes are fraught with statistical imprecision introduced during the alignment step7.
As an alternative paradigm, we considered if there could be a shared formulation that would allow signals of biological interest to be detected with statistical methods, directly from sequencing data (without a reference genome), unifying seemingly disparate sets of biological questions (currently addressed with custom and distinct bioinformatic approaches). For example: does a set of patients harbor distinct viral strains? Are alternative mRNA transcripts specific to distinct cell types or conditions? Does RNA expression and processing differ with geography or season? Notice that the answer to each involves studying how sequences vary within a set of samples. As we show, this idea can be made formal, and answers can be obtained with an approach that uses references as an interpretive guide rather than a critical step in statistical inference.
Here, we introduce a simple unifying paradigm for genomic analysis: SPLASH (Statistically Primary aLignment Agnostic Sequence Homing). SPLASH is applied directly to sequencing reads without the use of a reference genome. It relies on a simple formalization of sequence variation (short stretches of varying sequences, “targets”, adjacent to short stretches of a constant sequence, “anchors”) that captures many biologically important events. The distribution of anchor-target pairs are cataloged and the differences in their distribution between the samples is tested statistically at a very early stage in the pipeline. This simultaneously detects variation of myriad forms: single-nucleotide polymorphisms, alternative exon splicing, DNA rearrangement or transposition, and more. With today’s paradigm, a separate algorithm would be required to detect each type of variation. Instead, with SPLASH, if the user is interested in specific types of variation they can focus on that with post-hoc processing and analysis; alternatively, SPLASH may – and in many cases does – expand horizons to unexpected types of variation. In the results below, we demonstrate a snapshot of SPLASH’s wide possibilities for discovery, first in human samples: encompassing viral strain variation, alternative isoforms in single cells, B and T cell antigen receptor diversity. We also show that SPLASH is easily applied to less studied organisms: in lemur, to find antigen receptor diversity; in octopus, to detect tissue-specific splicing, processing and expression; and in eelgrass, to uncover allelic variation, splicing, and diatom association. We believe this demonstrates SPLASH’s potential to discover meaningful sequence variation, with or without the aid of reference genomes, across many biological questions and organisms.
Results
SPLASH is a k-mer based, statistics-first approach to identify sample-dependent sequence variation
The fundamental concept of SPLASH is anchors and targets (Figure 1A). An “anchor” is simply any particular k-mer of sequence in a read (k = 27 by default, but adjustable). Every k-mer a fixed offset downstream (R, which may be zero) is called a “target”. Note that targets are always defined relative to an anchor, and a given anchor may have multiple associated targets. Anchors with more than one target can report on nearly all sequence variation of interest: the simplest case is where two targets differ at a single position, as in mutations, allelic differences or RNA editing. It also can capture insertions and deletions, alternative splicing and isoform usage, mobility of transposons, gene rearrangements and fusions, antibody/T cell receptor somatic variation, and more.
The fundamental goal of SPLASH is to detect variation among a set of sequencing samples, and it requires only a FASTQ file for each sample. The definition of a “sample” is dictated by the biological question: each sample could be from different cells, different tissues, different individuals, or different mixed populations (e.g. metagenomics). Samples might differ by conditions as well – different times or treatments, and also in other features (cell-type, geographic location, etc.); we refer to these generically as “metadata”.
The principle of SPLASH is depicted in Figure 1B (and discussed in detail in Note S2 and Figure S1). Conceptually, SPLASH steps through all positions in all reads of all samples, recording all anchor-target pairs. To save compute time, SPLASH can run in a mode that only analyzes anchors at a subset of read positions, e.g., every fifth position (as used in this work). SPLASH compiles a separate counts table for each anchor, with a column for each sample and row for each target; each entry is the counts of a given target in a given sample; classically, this is a contingency table. Compiling these tables requires only a single pass through the raw sequences and doesn’t involve alignment to a reference, so it is computationally efficient. Importantly, we have developed a highly flexible test statistic that captures the desire to find relatively discrete groups of samples with differing variation, and that controls false positives even for low numbers of observations. This statistic admits a rigorous closed form p-value bound, which is thus fast to compute (unlike resampling methods). For each anchor, SPLASH calculates a p-value bound for the null hypothesis that the observed target frequencies in the samples all come from the same distribution, i.e., that there is no underlying variation of targets between the samples. A low p-value for an anchor implies that the samples differ significantly in which targets they contain.
We emphasize that SPLASH does not need predefined sample groups (though it can also operate in such a “supervised” mode). Indeed, for all the results presented here, we used SPLASH in its default unsupervised mode. For each anchor, SPLASH tries many random splits of the samples and the targets, retains the one which minimizes the p-value, and reports the corresponding effect size for this grouping (next paragraph). Although perhaps surprising, this random process is designed to and does detect the presence of strongly patterned target variation among the samples, if it exists (Figure S3, Note S3).
SPLASH also calculates for each anchor an “effect size” that ranges from 0 to 1, with 0 meaning that the target distribution is the same between the two groups, and 1 meaning that the targets found between the two groups are disjoint. Effect size does not require that target distributions of all samples within a group are similar, just that they are different from the other group; thus, effect size can be high even when there are more than two natural groups. Anchors with effect sizes closer to 1 have target variation that is more discrete across the samples, and thus are more amenable to biological investigation.
To aid in biological interpretation it can be useful to have longer sequence context than just the anchor and target. Thus, SPLASH also can generate a “consensus” for each anchor, in each sample. Consensuses are longer sequences (up to the length of a read) assembled from the raw reads of a given sample, looking at every occurrence of an anchor and extending its sequence base by base as long as the reads show a consensus (see Methods). Mapping consensuses to protein sequences can identify protein domains – protein domain profiling – a powerful, reference-free attribution of biological function. Consensuses can also be aligned to sequence databases or reference genomes; aligning only the consensus sequences for significant anchors reduces the typical computational load by 1000-fold compared to usual approaches that align all reads.
Figure 1C diagrams some of the differences between traditional alignment-first approaches and SPLASH; Figure 1D outlines interesting use cases for SPLASH. These guided us in our initial explorations with SPLASH, which are described below. The datasets analyzed were simply the first ones we chose, and may not be the best or most informative. While SPLASH has some adjustable parameters, we did not attempt to “tune” these (indeed, SPLASH seems robust to a range of parameters, Note S3); all analyses were run with the same settings, in unsupervised mode (blind to metadata). Despite this, we found that SPLASH performed well across a variety of datasets, and in all cases found significant patterns of sequence variation (q-values for anchors, and binomial p-values quantitating the visually evident spreads in plots, are given in the Note S7).
SPLASH identifies strain-defining and other mutations in SARS-CoV-2 de novo
Viral genomes have high mutation rates, including quasispecies2. For example, the emergence of SARS-CoV-2 was followed by multiple surges caused by variant strains, over the course of just two years. This is an ideal setting for the application of SPLASH: viruses are always mutating, but out of a sea of mutations, scientists, clinicians and public health officials want to identify those showing consistent variation between patient groups.
We applied SPLASH in unsupervised mode to two SARS-CoV-2 datasets, both viral amplicon Illumina sequencing of nasopharyngeal swabs from infection-positive patients, taken from times when the dominant viral strains were Delta or Omicron. The samples from South Africa8 (Nov to Dec-2021, 70 samples) represent the rapid rise of Omicron during its first outbreak. The samples from France9 (Dec-2021 to Feb-2022, 106 samples) represent cases of co-infection by more than one strain; the authors of the original study provided as metadata the assignment of the primary and secondary viral strains for each case (though not used by SPLASH).
SPLASH finds many significant anchors with low corrected p-values (<0.05) and high effect-sizes (>0.5), directly from sequencing reads (250 for South Africa dataset, 252 for France). High effect sizes are expected for anchors that partition samples by strain. To test if SPLASH recovers strain-defining and other variation, we examined the subset of significant anchors that perfectly map to a reference strain (Original, Delta, Omicron BA.1 or BA.2; defining mutations taken from CoVariants.org10), and call an anchor “strain-defining” if it has at least two targets with >5% abundance, at least one of which perfectly matches to a reference strain; by definition the other target is distinct. We compare to a control set of anchors, those that are most abundant across all the reads; this tests that SPLASH’s findings are not solely due to sequence abundance. In the South Africa dataset, 98% (126/128) of SPLASH anchors that mapped perfectly were strain-defining, compared to 7/201(3.5%) in the control set (hypergeometric p-value <1.7E-79). In the France dataset, 100% (39/39) of SPLASH anchors were strain-defining, compared to 8.4% (21/250) of the control anchors (hypergeometric p-value <2.6E-33). Nearly all the control anchors have only a single abundant target (with the remaining low abundance targets representing either private mutations or sequencing errors). Thus SPLASH, though blind to strain reference sequences and sample metadata, detects strain differences with high precision.
Figure 2A–C shows exemplary strain-defining mutations identified by SPLASH; we focus on the Spike protein (S gene), where the majority of the mutations were located. Figure 2A shows an anchor that distinguishes at the major lineage level: one target has no mutations (relative to the Original SARS-CoV-2 reference) and is consistent with Delta; the other target has the mutation K417N, found in all Omicron strains (but not in Delta or Original). Target abundances across samples are consistent with the strain assignment metadata. Figure 2B shows an anchor that discriminates sub-lineages: one target has no mutation, consistent with Delta; the second target has the 3-nt deletion NL211I and the 9-nt insertion R214REPE which are Omicron BA.1 specific; the third target has the mutation V213G which is Omicron BA.2 specific. Figure 2C shows an anchor that detects mutations not in our references. One target has a pair of mutations, N679K and P681H, found in all Omicron strains. The other targets all have S:P681R, a Delta-specific mutation, but two targets additionally encode Q677H (by different mutations). Mutations at Q677 may provide a selective advantage: they have arisen independently multiple times on different lineages11,12, and Q677H in several strain backgrounds enhanced infectivity, syncytia formation, and resistance to neutralizing antibodies in pseudotype assay13.
SPLASH results can also be analyzed completely without a reference genome by examining the coding potential of its nominated sequences. The anchor sequence is extended by a consensus approach, translated in silico to amino acid sequences (in all six reading frames) and matched against a database of protein domain models such as Pfam14; the best Pfam hit is assigned to that anchor. Protein profiles that are more frequently associated with significant SPLASH anchors, compared to control anchors, are candidates for proteins with important patterns of variation. SPLASH protein profiles for SARS-CoV-2 were overall extremely different from controls (chi-squared test p-values: France <3.7E-7, South Africa <2.5E-39) (Figure 2D). The top four SPLASH protein domains in both datasets were beta-coronavirus receptor-binding domain (RBD; within S1 region of spike protein), coronavirus S2 domain (within spike protein), coronavirus M (matrix/membrane) protein, and coronavirus ORF7a protein. By contrast, the top four control domains for each dataset were completely different from these, and also different between datasets. The SPLASH protein domains are significantly enriched when considered individually as well; for example, in the South Africa dataset for spike S2 domain: 23 SPLASH vs 3 control hits, p = 2.9E-6 (corrected hypergeometric p-value). The spike protein is the major site of antigenic variation in coronaviruses, as it is a principal focus of the immune response; the RBD is well known as a target for natural and therapeutic neutralizing antibodies15, but in addition about 50% of natural anti-spike antibodies are directed against the S2 domain16.
We also carried out SPLASH protein domain profiling on an unrelated virus, rotavirus17 (102 samples). The domains enriched over controls were rotavirus VP3 and NSP3 proteins (Figure S2A). These two proteins have roles in blocking host innate immunity18. Thus, variation in viral protein domains interacting with the immune system may be a recurring theme in SPLASH protein profiling of viral strains.
In summary, SPLASH finds patterns of variation in SARS-CoV-2, including those characteristic for strains, without requiring reference sequences or metadata (though these obviously help with interpretation); the methodology should be generally applicable to other viruses. More broadly, since SPLASH analyzes variation without any predetermined reference or template, it may be useful in surveillance for new strains or even completely new pathogens, and to cluster patients directly from raw sequencing samples.
SPLASH identifies regulated expression of paralogs and HLA in single cell RNA-seq
Today’s approaches to single cell bioinformatics require references and custom bioinformatic pipelines that struggle to quantify the extent of (post-)transcriptional regulation of gene expression which includes splicing and highly similar sequence variants, such as paralogs. We tested if SPLASH can unify discovery in the realm of single cell sequencing generated with the Smart-Seq2 (SS2) protocol19, which provides broad transcript coverage.
Our first testbed was human macrophage and capillary cells from the Human Lung Cell Atlas20. We previously established that these cell types have regulated alternative splicing in MYL6, a light chain subunit of myosin motor protein, so that serves as a positive control21. As expected, among SPLASH’s most significant anchors are ones reporting this MYL6 alternative splicing, splicing in or out of a cassette exon (Figure S4A). Surprisingly, other SPLASH top anchors also involved different myosin light chain subunits, MYL12A and MYL12B. These two genes are paralogs with highly similar coding regions (95% nucleotide, 98% amino acid identity for human). Nevertheless, SPLASH finds targets that specifically distinguish them. They show that macrophages express more MYL12A, while capillary cells express more MYL12B (Figure 3A). This same pattern is observed in two different individuals. Not much is known about these two genes, but they are known to show differential expression in rat tissues22, and there is evolutionary conservation in mammals, birds, and reptiles of two adjacent MYL12 paralogs within a syntenic region (e.g. human; rat22; Gallus gallus, NCBI Gene IDs 396284 and 770011; Chelonia mydas, Gene IDs 102938771 and 102937279). Besides the small number of amino acid differences between the paralogs, there may also be an important functional role for nucleotide sequence differences, as has been demonstrated for another pair of highly similar cytoskeletal paralogs, beta- and gamma-actin23.
In the same data, SPLASH finds cell type-specific expression of genes in the major histocompatibility complex (MHC), known as HLA in humans. HLA is the most polymorphic region in the human genome, with the most known disease associations; the polymorphism of HLA class I and class II proteins is intimately tied to their function in antigen presentation for adaptive immunity24. Due to the high levels of polymorphism, the region is challenging for representation in a reference genome and for read-mapping pipelines. SPLASH identifies an anchor and targets that distinguish between the highly similar 3’ untranslated regions (UTRs) of HLA-DRB1 and HLA-DRB4 (class II beta-chains) (Figure 3B). Macrophages express mainly DRB1, while capillary cells mainly express DRB4. The same pattern is found in two different individuals, who carry different alleles at DRB1 and DRB4 (as seen in the alignment). Macrophages are “professional” antigen-presenting cells and constitutively express HLA class II. Not all endothelial cells express class II in vivo, however most human capillary cells do25; endothelial MHC expression may be strongly cytokine-dependent26. Thus, class II expression in macrophages and capillary cells is likely to be regulated differently. Five major haplotypes have been identified at the HLA-DRB locus: all contain a DRB1 gene, but some contain an additional functional paralog, either DRB3, DRB4, or DRB527. DRB1 is extremely polymorphic (3516 alleles, in March 2023); the paralogs somewhat less so, e.g. DRB4 (236 alleles)28.
In one individual (P3), SPLASH also finds a remarkable anchor and two targets that report on a pair of class II genes, HLA-DPA1 and HLA-DPB1 (class II alpha and beta chains, respectively), which are unique among HLA genes in being organized head-to-head and transcribed in opposite directions. The anchor lies in sequence shared by DPB1 and a specific isoform of DPA1, while the targets lie in exons exclusive to each; the extended sequences provided by SPLASH consensuses confirm directionality as they bridge splice junctions. Macrophages in this individual express exclusively DPB1, while capillary cells express mainly DPA1 (Figure 3C). This pattern may be haplotype-specific, as we did not find a similar anchor for another individual (P2).
A final example is SPLASH’s detection of allele-specific expression of HLA-B in human T cells (from a different dataset, see next section). The class I gene HLA-B is the most polymorphic of all HLA genes (9274 alleles)28, and HLA-B is the gene with the most anchors found by SPLASH in human T cells (Figure 3D). Since these T cells all come from one individual, this indicates substantial variation in expression of this individual’s two HLA-B alleles at the single-cell level, as can be seen in the scatterplot (Figure 3D). Different T cells express a wide range of ratios of the two alleles, some cells expressing both alleles, but others expressing almost entirely one allele or the other (well outside the 98% confidence interval of what is expected by the average ratio). This is in keeping with a preprint that found allele-specific expression of HLA class I genes in normal breast epithelial cells, as well as in multiple myeloma cells29; other single-cell studies have found changes in allele-specific expression during T cell activation30, and identified HLA expression quantitative trait loci31.
Overall, SPLASH finds disparate biological features producing regulated variation at the single cell level: paralogs, splicing, and alleles, especially within the medically important and polymorphic HLA locus. This is just a small glimpse into the complexity of HLA haplotype- and cell type-specific expression patterns; it raises the possibility that disease-related HLA alleles might be expressed differently in key cell types compared to other alleles.
SPLASH identifies B and T cell receptor diversity in human and lemur single cell RNA-seq
Adaptive immune receptors for B cells (immunoglobulin or Ig), and T cells (T cell receptor or TCR) are generated combinatorially through V(D)J recombination, and Ig is further diversified through somatic hypermutation. Rearranged sequences are absent from germline reference genomes and cannot be cataloged comprehensively due to their huge potential diversity, empirically estimated at 1010–1011 for Ig heavy chains32. These genomic loci currently require manual curation due to their complexity and repetitive structure, so only a few species have high quality annotations. Existing methods to assign V(D)J rearrangements in single cells33 depend on annotations and so may have blindspots. Since SPLASH is designed to identify sequence diversity without need for references, we predicted that it would identify adaptive immune receptors de novo.
We ran SPLASH on 50 naive human B cells from peripheral blood of one individual, and separately on 128 CD4+ human T cells of another individual, taken from Tabula Sapiens, a large multi-organ dataset34. As a first pass, requiring no reference or alignment at all, protein profiling found that the most frequent domains matching to SPLASH anchors in B cells were Ig V-set and C1-set (variable-like and constant-like domains); these two domains were also matches in T cells (attributable to TCR) (Figure 4A). Mapping transcript gene-names to SPLASH anchors gives a similar picture: Ig light chain genes (both kappa and lambda) were strongly predominant for B cells; HLA-B (discussed above) and TCR genes (both alpha and beta) were most prominent for T cells; and these are not found among the control anchors which are up to 10-fold more abundant (Figure 4A). Significant anchors were also found in Ig heavy chains (shown in lemur, below), though fewer anchors than in Ig light chains. Ig/TCR anchors characteristically have a high diversity of targets (“entropy”, a measure reported by SPLASH), and could be identified on that basis rather than requiring reference mapping. This is expected for clonally diverse receptors, and is very evident in the clonotypic pattern (each cell expressing only its specific target) seen in the heatmaps (Figure 4B).
To showcase application to non-model organisms, we also ran SPLASH on mouse lemur (Microcebus murinus). Mouse lemurs are primates that diverged from humans 60–75 million years ago, and have potential as a model organism35. The lemur reference genome is incompletely annotated, especially at loci such as Ig and TCR. While the human reference does not suffice to perform alignment-first analysis of mouse lemur transcriptomes, we find here that it is a reasonable approximation for interpreting SPLASH outputs; this may generalize, allowing SPLASH interpretation for organisms wherever a related, better-curated reference exists. From Tabula Microcebus, a multi-organ mouse lemur dataset36, we analyzed 111 natural killer T (NKT) cells from spleen and, separately, 289 B cells also from spleen. In both analyses, the cells came from two different individuals; for this reason, SPLASH also discovered numerous allelic differences between individuals, such as in COX2 (Figure S5A).
Our main focus was on adaptive immune receptors; similar to the human analyses, we found that SPLASH’s lemur anchors in B and NKT cells included Ig C1-set and V-set domains by protein profiling and Ig/TCR gene-names by transcript mapping (data not shown). As expected, SPLASH targets for lemur Ig heavy chain are predominantly clonotypically expressed (Figure 4B). Lemur NKT cells provide an interesting counterpoint. While there is some clonotypic diversity, it is striking that a number of cells share TCR-alpha sequences; notably, the shared target is different between the two individuals (bottom-right heatmap, top row vs. second row). We had selected NKT cells for analysis without foreknowledge of their properties. However, SPLASH has correctly characterized diversity in this case: it is known that an NKT subset expresses an “invariant” TCR-alpha chain with limited TCR-beta diversity, in humans and mice; NKT cells bridge between adaptive and innate immunity37. For Tabula Microcebus, NKT cells were operationally defined as co-expressing CD3E and KLRB1 (CD161)36; in this cell-type, SPLASH also finds shared usage in TCR-beta and TCR-gamma (Figure S5C and D).
To test if SPLASH finds diversity that may be missed by alignment-based methods, we analyzed a subset of 35 lemur B cells for which Ig light chain variable regions could not be assigned by the program BASIC38. BASIC uses curated human Ig sequences to assign V-D-J regions to Ig mRNAs; although it was able to assign the large majority of lemur B cells, there was a small subset for which it failed. Using a quite naive approach, we used SPLASH to find evidence of a light variable region in one of the 35 cells; we were able to reconstruct the full variable region from reads (Note S4). In two cells, there were hits to the surrogate light chain (IGLL1/IGLL5 or λ5), which associates with Ig heavy chain when there is not yet a rearranged light chain39. The additional yield from SPLASH was modest, but is proof-of-principle that SPLASH provides views on data distinct from traditional methods. In more recent work we have built on this capability (J. Salzman et al., unpublished work).
SPLASH applied to non-model organisms: octopus and eelgrass
To explore SPLASH’s generality, we applied it to two understudied organisms: octopus and eelgrass. Octopuses have the most complex sensory and nervous systems among invertebrates, and are unusual in having high levels of RNA editing40. The marine angiosperm Zostera marina, or eelgrass, is the most widely distributed seagrass, and its adaptation to varying conditions, especially in the face of climate change, is of great interest and only beginning to be explored at the genomic level41,42.
We focused narrowly on anchors where no more than one of its targets mapped to the respective reference (Methods), that is, where reference mapping could not have detected variation, to showcase what SPLASH can uncover that would be missed by conventional approaches. However, that means that in these analyses we ignored any interesting findings of SPLASH that are consistent with the reference.
For octopus, we analyzed an RNA-Seq dataset of Octopus bimaculoides43, encompassing a variety of tissues from a single individual (N. Bellono, personal communication). We examined several anchors with high effect sizes and BLAST hits to the closely related species Octopus sinensis (Methods; Data S6). An interesting SPLASH anchor was found in O. sinensis myosin-VIIa, known as MYO7A in humans; mutations in MYO7A cause Usher syndrome, leading to deafness and blindness44. Target 1 corresponds to the annotated first exon of O. sinensis myosin-VIIa, while target 2 represents a novel alternative first exon (not annotated in O. sinensis) that is expressed only in statocyst tissue (Figure 5A). Although neither target is part of the annotated O. bimaculoides myosin-VIIa gene, both target sequences are present shortly upstream of it in the genome; but the region containing the anchor is missing (Note S5). Incidentally, the O. sinensis myosin-VIIa gene also shows likely misassembly (Note S5). The statocyst-specific expression of an alternative first exon is intriguing given MYO7A’s association with Usher syndrome and deafness, as the octopus statocyst is a sensory organ for sound and balance45,46, suggesting homologous gene function.
Other SPLASH anchor-targets did not map to the O. bimaculoides genome or transcriptome yet did BLAST to 3’ UTRs of annotated transcripts in O. sinensis, including carboxypeptidase D, Upf2, and netrin receptor/DCC (Figure S6, Data S6). For all three, each target is expressed exclusively in some tissues and not others. Although O. bimaculoides has annotated transcripts for these genes, in all three cases the 3’ UTR region is missing or likely incorrect in the O. bimaculoides genome assembly. For two of the genes, the target variation may be due to differential expression of alleles: a 13 nt deletion in carboxypeptidase D, and a short CAG repeat polymorphism in nonsense-mediated decay gene Upf2 (“regulator of nonsense transcripts 2”) (Figure S6A and B). For netrin receptor, involved in axon guidance and apoptosis, the variation SPLASH detects could be allelic but is also consistent with A-to-I RNA editing (Figure S6C). Our focus here on non-mapping anchor-targets excluded many more readily interpretable examples of regulated variation; for example, other evidence for RNA editing in numerous octopus genes (data not shown).
We also applied SPLASH to RNA-Seq data from eelgrass (Zostera marina), collected in two locations, Montpellier, France (Mediterranean climate) and Rovika, Norway (near-arctic climate), in two seasons (winter and summer), and during day and night42. Considering each anchor with its most abundant target, a large number (14,680, 5.7%) did not align to eelgrass references (Data S7). A high-level view of these is provided by protein profiling of SPLASH anchor-targets: the top hits were chlorophyll A-B binding protein domain, Actin, Ubiquitin, and Silicon transporter (Figure 5B). protein BLAST of some of the translated sequences finds that these have their best hits to a variety of organisms other than eelgrass, notably diatoms. This was a surprise to us, but in fact it has long been known that eelgrass is extensively colonized by epiphytes47, of which diatoms predominate48 and may provide as much as 71–83% of the primary production by the community49. For the most enriched protein domain, we investigated an anchor with high effect size (0.66); its consensus matches to “fucoxanthin chlorophyll a/c protein” (FCP) in several diatoms, for example, Phaeodactylum tricornutum (95% amino acid identity, 81% nucleotide identity; Figure S7C); given that the matches are imperfect, the true species of origin may not be in the NCBI database. FCPs function as light-harvesting antennae for photosynthesis50. This anchor has several targets (differing in their amino acid sequence), whose abundance varies by the sampling location and time of year: target 1 is strongly predominant in France in June; targets 3, 4, 5, which share the same amino acid sequence, together dominate in France in December; and target 6 is predominant in Norway in December (Figure 5C). These targets could represent different diatom species or intra-species allelic variants. The abundance of this anchor (irrespective of target) is lower in Night samples (Figure S7C), indicating circadian regulation of this diatom photosynthetic gene. Other anchors mapping to diatoms, such as ferredoxin and high mobility group box-containing protein, also show targets that segregate by location, France vs. Norway (Figure S7A and B).
One of the top anchors, although mapping to eelgrass reference, does report on novel variation. It is in the NdhL subunit of chloroplast NADPH dehydrogenase complex (NDH). Of its four most abundant target sequences, three are within exon 3 and are SNP coding variants; the fourth represents retention of the intron following exon 3 (Figure 5D). Intron retention alters the second transmembrane segment and terminates translation soon after (Note S5). The intron retention variant (target 4) is highest in winter (Figure 5D): for Norway, December (green) vs June (red) samples completely segregate by target 4 expression; the France samples show overlap, but on average December (blue) is higher than June (yellow). The scatterplot of Figure 5D also illustrates some other patterns: Norway samples do not express target 2 (instead they express target 1 and 3; data not shown); France samples have either a high fraction of target 2, or moderate (the latter also express target 1; data not shown). NDH is involved in cyclic electron transport51 and modulation of NDH function may affect photosynthetic efficiency and oxidative stress under varying light conditions52.
The above work with octopus and eelgrass are just early forays, but show that the SPLASH approach can discover novel regulated RNA splicing and isoforms, and bring to light allelic variation and communities of associated organisms. SPLASH’s statistics-first and reference-free methodology is an essential basis, which can be augmented by protein profiling and wide use of the sequence database beyond solely genomic references.
Discussion
Genomic analysis today is performed with complex computer scientific workflows that are highly problem-specific. Here, we have shown that a single, simple statistics-first analytic framework unifies the approaches taken in disparate subfields of genomics and enables novel discovery missed by current workflows. We have achieved this by formalizing sample-dependent sequence variation in a single probabilistic model, and using this model to develop SPLASH. SPLASH frames prioritizes interpretable sequence variation as relations between sequence k-mers: anchors and targets; doing this required developing a new statistical test. While there is a history of using k-mers to detect biological variation, some statistical53, it was restricted to coarse-grained studies of genome composition.
Here, we provide a snapshot of SPLASH’s discoveries in very disparate subfields of genome science. When run on SARS-CoV-2 patient samples during the emergence of the Omicron variant, without strain metadata or reference genomes, SPLASH finds many anchors capturing strain defining mutations and emerging mutations, with good post-facto agreement with metadata. Using protein domain analysis, SPLASH finds that the spike protein is highly enriched for sequence variation, without using any reference sequences. This points to a broader potential for SPLASH in viral and other genomic surveillance, since emerging pathogens will likely have sequence variation missing from any reference.
In single-cell sequencing data, SPLASH is able to define differential expression between highly similar genes, including several different HLA genes and myosin light chains MYL12A and B. Like all the work presented here, SPLASH analysis is conducted in unsupervised mode; yet many of its significant anchors show cell-type regulation between macrophages and capillary endothelial cells. This testifies to the power of SPLASH’s unique statistical approach. In particular, HLA genes are highly polymorphic and traditionally difficult to analyze in genomics, yet have important associations to disease susceptibility. When applied to B and T cells of both human and mouse lemur, SPLASH automatically identifies antigen receptor genes, which are somatically assembled by V(D)J recombination, as exhibiting the most diverse variation. Lemur analysis was performed using only an approximate genomic reference (human) that diverged from lemur ~60 million years ago. SPLASH might also be used to shed new light on other polymorphic loci including non-coding RNAs, e.g. spliceosomal variants.
To showcase SPLASH’s wide applicability, we used it on datasets from very different organisms: octopus and eelgrass. We narrowly restricted our analysis to anchor-targets that did not map to the respective genome assemblies. In octopus, we identified several tissue-regulated transcripts, in particular one in Myo-VIIa that is only expressed in statocyst. In the eelgrass dataset, SPLASH uncovered that many of the sequences come from epiphytic diatoms, and have variation that correlates with geography and season. A broader analysis, allowing mapping to the genome (as done in the human examples), would almost certainly find many more examples of regulated expression in these understudied organisms. This highlights the enormous potential in already existing datasets, and the need for tools like SPLASH to better explore them.
SPLASH should be of general interest to most genomic analyses. Users should be able to easily run SPLASH on their own samples (FASTQ files): we provide it as a containerized Nextflow pipeline to minimize installation issues, and the default parameters should work well (SPLASH is robust to a range of parameters, Note S3); it can even run on a laptop (Note S6). SPLASH will provide a large output file of significant anchors and targets. What next? First is to realize that these results are a large data reduction and distillation of the variation present in the samples. There are many ways that they can be used. The typical steps are to attach as much biological meaning to them as possible. If metadata is available, it can be correlated with anchor-targets generated in unsupervised mode; alternatively, one can use metadata to supervise SPLASH analysis. If a reference genome is available, SPLASH can use it to align anchors and targets and provide gene names. SPLASH provides a number of metrics, such as p-value, effect size, target entropy, average target similarity (Hamming or Levenshtein distance) which can be used to tailor the anchor list. Another useful avenue is BLAST of anchor-targets or consensus sequences against the NCBI databases (both as nucleotide, and protein after in silico translation); this can be very helpful especially when there is no reference genome or it is incomplete. Also useful is protein domain profiling with databases like Pfam. Ultimately, users will bring their own area expertise to bear in deciding how to best utilize SPLASH results.
Even in areas where there are existing pipelines, for example in differential alternative splicing, or antigen receptor identification, SPLASH provides a very different paradigm and may well give additional insights. And where there are no good references, it may be the only feasible option. SPLASH scales to allow discovery to keep pace with the ever-increasing sequence data from the world at large, in particular microbes and metagenomic communities; recent collaborative work has further increased the computational efficiency of SPLASH54. The SPLASH paradigm is expansive. It could be applied to a wide range of “omics” modalities, from DNA and protein sequencing to Hi-C and spatial transcriptomics, and more (see Note S1 “Generality of SPLASH”). The statistical ideas underlying SPLASH are also expansive: anchor-target pairs can be generalized to tensors, and higher-dimensional relations between anchors, targets, and samples can be studied; other functions for splitting and hashing targets and samples can be considered, to optimize statistical power55.
In summary, SPLASH illustrates the power of statistics-first genomic analysis, where references and annotations are only optionally used for post-facto interpretation. References are valuable for interpretation of sequence data; however, the filtering of data by alignment to references introduces statistical bias in quantification, blindspots and errors. SPLASH re-directs from the “reference-first” approach to “statistics-first”, performing statistical hypothesis tests on raw sequencing data. By this design, SPLASH is also highly efficient. SPLASH promises data-driven biological study with scope and power previously impossible.
Limitations of the study
SPLASH can be applied to problems across diverse fields which are of great current importance (Note S1), including those previously discussed. Naturally, some problems are not directly amenable to SPLASH analysis as formulated here. The most clear are cases where quantification of sample-specific RNA or DNA abundance alone is desired (e.g., differential gene expression analysis). Additionally, SPLASH is currently unable to distinguish which biological mechanism underlies the called variation, and work in progress seeks to address this.
Materials and Methods
SPLASH overview
Full details of running SPLASH and its outputs can be found at https://github.com/salzman-lab/nomad. Briefly, it takes as inputs a set of FASTQ sequencing data files (one per sample). There are various parameter choices, but it should work well with the default settings. The standard output includes a table of anchors, targets, p-values, etc., a table of “consensus” sequences (see below), and a table of “element annotations” (see below). SPLASH can optionally also perform alignments with bowtie2 and STAR, to generate “genome annotations” and splice junction annotations (see below).
SPLASH anchor preprocessing and parameter choices
Anchors and targets are defined as contiguous subsequences of length k positioned at a distance R = max(0, (L - 2 * k) / 2) apart, where L is the length of the first read processed in the dataset, and R is rounded to the nearest integer. If L = 100 and k = 27, then R = 23. We note that for a fixed number of anchor-target pairs, under alternatives such as differential exon skipping, larger choices of R have provably higher power than smaller choices, following the style of analysis in57. k = 27 is typically long enough to be assigned a unique position in a genome while having a low probability of containing a sequencing error. Anchor sequences can be extracted as adjacent, disjoint sequences or as tiled sequences that begin at a fixed step size, to reduce computational burden. For this manuscript, SPLASH was run with default parameters: with 1M reads per FASTQ file, anchor sequences tiled by 5 bp, and k = 27. For HLCA datasets, both read 1 and read 2 were used; for other datasets, only read 1. Extracted anchor and target sequences are then counted for each sample with the UNIX command, ‘sort | uniq -c’, and anchor-target counts are then collected across all samples for restratification by the anchor sequence. This stratification step allows for user control over parallelization. To reduce the number of hypotheses tested and required to correct for, we discard anchors that have only one unique target, anchors that appear in only 1 sample, and (anchor, sample) pairs that have fewer than 6 counts. Then, we retain only anchors having more than 30 total counts after the above thresholds were applied. This approach efficiently constructs sample by target counts tables for each anchor.
SPLASH p-values
SPLASH’s analysis is based on a new statistical method for analyzing contingency tables that provides p-values to reject the null hypothesis that row (target) counts are drawn from the same distribution for all samples. These p-values are closed form, meaning that no permutation, resampling or numerical solutions to complex likelihoods are required for significance tests.
We develop a test statistic S that has power to detect sample-dependent sequence variation and is designed to have low power when there are a few outlying samples with low counts as follows. First, we randomly construct a function f, which maps each target independently to {0,1}. We then compute the mean value of targets with respect to this function. Next, we compute the mean within each sample of this function. Then, an anchor-sample score is constructed for sample j, Sj, as a scaled version of the difference between these two. Finally, the test statistic S is computed as the weighted sum of these Sj, with weights cj (which denote class-identity in the two-group case with metadata and are chosen randomly without metadata, see below). In the below equations, Dj,k denotes the sequence of the k-th target observed for the j-th sample, and M denotes the total number of observations of this anchor.
Statistically valid p-value bounds (non-asymptotic, providing control of Type I error for finite number of observations) are computed as:
by applying Hoeffding’s inequality on these sums of independent random variables (under the null). The derivation is detailed in the Note S2.
This statistic is computed for K different random choices of f, and in the case where sample group metadata is not used or available, jointly for each of the L random choices of c, here with K = 10 and L = 50. We call the random choice of cj’s “random c’s” below. The choice of f and c that minimize the p-value are reported, and are used for computing the p-value of this anchor. To yield valid p-values we apply Bonferroni correction over the L×K multiple hypotheses tested (just K when sample metadata is used and randomization on c is not performed). Then, to determine the significant anchors, we apply Benjamini-Yekutieli (BY) correction58 to the list of p-values (for each anchor), yielding valid FDR controlled q-values reported throughout the manuscript implemented with the statsmodels.api.stats.multipletests functionality in python.
SPLASH effect size
SPLASH provides a measure of effect size when the cj’s used are ±1, to allow for prioritization of anchors with large inter-sample differences in target distributions. Effect size is calculated based on the split c and function f that yield the most significant SPLASH p-value. Fixing these, the effect size is the absolute value of the difference between the mean function value over targets (with respect to f) across those samples with cj = +1 denoted A+, and the mean over targets (with respect to f) across those samples with cj = −1 denoted A−.
This effect size has natural relations to a simple two group alternative hypothesis. It can also be shown to relate to the total variation distance between the empirical target distributions of the two groups. These connections are discussed further in Note S2.
Consensus sequences
A consensus sequence is built for each significant anchor for the sequence downstream of the anchor sample. A separate consensus is built for each sample by aggregating all reads from this sample that contain the given anchor. Then, SPLASH constructs the consensus as the plurality vote of all these reads; concretely, the consensus at base pair i is the plurality vote of all reads that contain the anchor, i base pairs after the anchor appears in the read (a read does not vote for consensus base i if it has terminated within i base pairs after the anchor appeared). The consensus base as well as the fraction agreement with this base among the reads is recorded. Some empirical behavior of consensuses is shown in Figure S4.
The consensus sequences can be used for both splice site discovery and other applications, such as identifying point mutations and highly diversifying sequences, e.g. V(D)J rearrangements. The statistical properties of consensus building make it an appealing candidate for use in short read sequencing59, and may have information theoretic justification in de novo assembly60 (Note S1).
To provide intuition regarding the error correcting capabilities of the consensus, consider a simple probabilistic model where our reads from a sample all come from the same underlying sequence. In this case, under the substitution only error model, we have that the probability that our consensus for n reads makes a mistake at a given location i under independent sequencing error rate ϵ (substitution only) is at most
We can see that even for n = 10, this probability is less than 1.3E-7 for a given base pair, which we can union-bound over the length of the consensus to yield a vanishingly small probability of error. Thus, for a properly aligned read, if a base pair differs between the consensus and reference it is almost certainly a SNP.
Element annotations
To identify false positive sequences or contextualize mobile genetic elements, anchors and targets are aligned with bowtie2 to a set of indices, corresponding to databases of sequencing artifacts, transposable elements, and mobile genetic elements61. In these alignments, using bowtie2, the best hit is reported, relative to an order of priority. The references used are: UniVec, Illumina adapters, grass carp (Ctenopharyngodon idella genome, GCA_019924925)62, Escherichia phage phiX174, Rfam63, Dfam64, TnCentral65, ACLAME66, ICEberg67, CRISPR direct repeats68, ITSoneDB69, ITS270; and also the reference genome of interest for the study. (Grass carp was used as a control as it contains many artifactual Illumina adapters.) To perform these annotations, bowtie2 indices were built from the respective reference FASTAs, using bowtie2-build with default parameters. Anchors and targets were then aligned to each index, using bowtie2-align with default parameters. For each sequence, we report the alignment to the reference and the position of that alignment for each reference in the prespecified set. Anchors and targets, and their respective element annotations, are reported in the element annotation summary files.
Genome annotations
Anchor, target, and consensus sequences can be aligned by SPLASH to reference genomes and transcriptomes, to provide information about the location of sequences relative to genomic elements. All alignments reported are run in two modes in parallel: bowtie2 end-to-end mode (the bowtie2 default parameters) and bowtie2 local mode (`-local`, in addition to the bowtie2 default parameters). To report alignments to the transcriptome, the sequences are aligned to the reference transcriptome with bowtie2, with `-k 1`, in addition to the above parameters, to report a maximum of one alignment per sequence. If there is a transcriptome alignment, we report the alignment to the reference and the MAPQ score of the alignment. To report alignments to the genome, the sequences are aligned to the reference genome, with the same parameters above. If there is a genome alignment, we report the alignment to the reference, the strand of the alignment, and the alignment MAPQ score.
Splice junction calls
To identify exon coordinates for reporting annotations in this manuscript, consensus sequences are mapped with STAR aligner (default settings)71. Gapped alignments are extracted and their coordinates are annotated with known splice junction coordinates using ‘bedtools bamtobed --split’; each resulting contiguously mapping segment is called a “called exon”. From each consensus sequence, called exons are generated as start and end sites of each contiguously mapped sequence in the spliced alignment. These ‘called exons’ are then stratified as start sites and end sites. Note that the extremal positions of all called exons would not be expected to coincide with a splice boundary; “called exon” boundaries would coincide with an exon boundary if they are completely internal to the set of called exon coordinates. Each start and end site of each called exon is intersected with an annotation file of known exon coordinates; it receives a value of 0 if the site is annotated, and 1 if it is annotated as alternative. The original consensus sequence and the reported alignment of the consensus sequence are also reported. Gene names for each consensus are assigned by bedtools intersect with gene annotations (hg38 RefSeq for human data by default), possibly resulting in multiple gene names per consensus.
SPLASH protein domain profiles
Custom scripts were used to generate protein domain profiles. For each set of enriched anchors, homology-based annotation was attempted against an annotated protein database, Pfam14. For each dataset, up to 1000 of the most significant anchors (q-value < 0.01) were retained for the following analysis: we first generated a substring of each downstream consensus by appending each consensus nucleotide assuming both conditions were met: a minimum observation count of 10 and a minimum agreement fraction of 0.8, until whichever metric first exhibited two consecutive failures at which point no further nucleotide was added. A limit of 1000 anchors was used due to computational constraints from HMMer3 (see below). Anchors that did not have any consensus nucleotides appended were kept as is. An extended anchor was generated for each experiment in which an anchor was found. Each extended anchor was then stored in a final concatenated multi FASTA file with unique seqID headers for each experiment’s extended anchors.
To assess these extended anchors for protein homology, this concatenated FASTA file was then translated in all six frames with the standard translation table using seqkit72 prior to using hmmsearch from the HMMer3 package73 to assess resulting amino acid sequences against the Pfam35 profile Hidden Markov Model (pHMM) database. The resulting ‘raw’ .tblout outputs were then processed, keeping the best hit (based on E-value) per each initial anchor, and any hits with an E-value better than 0.01 were parsed into an *_nomad.Pfam (or *_control.Pfam) file used for subsequent plotting.
All hits to the Pfam database were then binned at different E-value orders of magnitude and plotted. In each case, control assessments were performed by repeating the extension and homology searches against an equivalent number of control anchors (see below). The number of matched anchors used for SPLASH and control analysis per dataset were as follows: 201 high effect size (.5) anchors in SARS-CoV-2 from South Africa, 252 high effect size (.5) anchors in SARS-CoV-2 from France; 1000 anchors (no effect size filter) were used for rotavirus, human T cells, human B cells, Microcebus natural killer T cells, and Microcebus B cells. We note that while the number of input anchors for SPLASH and control sets are matched, it is possible to have more control protein domains in the resulting barplots, as only high E-value hits to Pfam are reported in the visualizations. Domain profiling summaries are in Data S1.
A hypergeometric test was used to give p-values for protein domain analysis. For a given domain, we construct the 2×2 contingency table, where the first row is the number of SPLASH hits for this domain, followed by the total number of SPLASH hits not in this domain. The second row is the mirror of this for control, where the first entry is the number of control hits for this domain, followed by the total number of control hits not in this domain. A one-sided p-value is computed using Fisher’s exact test, which is identically a hypergeometric test. We apply Bonferroni correction for the total number of protein domains expressed by either SPLASH or control, to yield the stated p-values.
Lastly it is worth noting that while only counts of the best scoring Pfam hits were assessed in this study, other information is also produced by HMMer3. In particular, relative alignment positions are given for each hit which could be used to more finely pinpoint the precise locus at which sequence variation is detected.
Control analyses
To construct control anchor lists based on abundance, we considered all anchors input to SPLASH and counted their abundance, collapsing counts across targets. That is, an anchor receives a count determined by the number of times it appears at an offset of 5 in the read up to position R - max(0,R/2–2*k) where R is the length of the read, summed over all targets. The 1000 most abundant anchors were output as the control set. For analysis comparing control to SPLASH anchors, min(|SPLASH anchor list|, 1000) most abundant anchors from the control set were used and the same number of SPLASH anchors were used, sorted by p-value.
Generation of contingency table heatmaps.
To plot the anchor-target heatmaps, we exclude targets with low counts. Concretely, we by default filter out targets that occur fewer than 5 times, have less than 5% of the total counts of that anchor, and retain at most the top 10 targets, while ensuring that at least 2 targets are plotted. Then, all samples with fewer than 5 counts are discarded. For clarity of presentation, we include or remove rows corresponding to additional targets based on biological relevance.
SARS-CoV-2 analysis
SARS-CoV-2 data was downloaded from the NCBI: France9 (SRP365166) and South Africa8 (SRP348159). Sample metadata for the France dataset was provided by the authors via personal communication, and consists of their calls of the primary and secondary infecting strains for each patient sample. We note that sample ‘WTA-022002271301_S1’ appeared to be mislabelled, appearing in the metadata file but not in the NCBI sample list. Conversely, the sample ‘Pl924–022002271301_S1472’ appears in the NCBI sample list, but not in the metadata file. Thus, we associate these labels to each other, to obtain metadata labeling for all 106 samples. We do not have information regarding which samples are replicates. We provide the NCBI sample list and the strain metadata file as Data S8.
The SARS-CoV-2 datasets used in this manuscript were analyzed with SPLASH’s unsupervised mode (no sample metadata provided). To identify high effect size anchors, a threshold of èffect_size_randCjs` > 0.5 was used (table S2).
For the purposes of strain-defining mutation analysis, we manually constructed “archetype” genome sequences for variant strains Delta, Omicron BA.1, and Omicron BA.2 by editing the Original (Wuhan) reference NC_045512.2 to contain all (and only) the defining mutations specified at CoVariants.org10; these are provided as Data S5.
To determine what SPLASH calls (and control anchors) were strain defining we perform the following. To generate SPLASH’s calls, we filter for anchors that are significant (with a BY corrected p-value less than .05) and have large effect size (> .5), yielding a list of N SPLASH-called anchors. Control anchors are generated by taking the N anchors with the highest counts. For each of these anchors we construct their target × sample contingency table, first filtering out all anchors with fewer than 30 counts, only 1 unique target, or only 1 unique sample, and filtering out all samples with 5 or fewer counts. Then, we discard all targets that constitute less than 5% of the remaining counts for that anchor. The remaining anchors and targets are then bowtie aligned to an index comprised of the Wuhan reference genome (NC_045512.2), and Delta, Omicron BA.1, and Omicron BA.2 archetype genomes. For this alignment, options ‘-a -v 0’ were used. Then, for each set of anchors (SPLASH calls, and controls), the list is filtered to only anchors that align perfectly to at least one of the reference assemblies, further requiring that each anchor have at least one target that aligns perfectly to a reference assembly. Then for each anchor, we declare it to be strain defining if, for any of the reference assemblies, it has at least one target that maps to it and one target that does not.
Identifying cell-type specific isoforms in single-cell data (lung macrophages and capillary cells)
The human lung scRNA-seq data used here (HLCA SS2)20 is accessible through the European Genome-phenome Archive (accession number: EGAS00001004344); FASTQ files from donor 1 (P2) and donor 2 (P3) generated with the Smart-seq2 protocol were used. In the analysis of HLCA SS2 data, we utilize “isoform detection conditions” for alternative isoform detection. These conditions select for (anchor, target) pairs that map exclusively to the human genome, anchors with at least one split-mapping consensus sequence, mu_lev > 5, and M > 100. mu_lev is defined as the average target distance from the most abundant target as measured by Levenstein distance. M is defined as the total number of counts in the anchor’s contingency table. To identify anchors and targets that map exclusively to the human genome, we included anchors and targets that had exactly one element annotation, where that one element annotation must be grch38_1kgmaj. To identify anchors with at least one split-mapping consensus, we selected anchors that had at least one consensus sequence with at least 2 called exons. The conditions on Levenshtein distance, designed to require significant across-target sequence variation, significantly reduced anchors analyzed (excluding many SNP-like effects). We further restricted to anchors with M > 100, to account for the lower numbers in macrophage cells; note that the user can choose to use a lower M requirement, based on input data. These isoform detection parameters were used to identify the SS2 examples discussed in this manuscript, MYL12. For HLA discussion, gene names were called using consensus_gene_mode.
While here we focus on anchors that have aligned to the human genome, we note in passing that SPLASH makes many predictions of cell-type specific RNA expression that include sequences that map to repetitive elements or do not map to the human reference: for individual P2 (respectively P3), 53% (61%), of 4010 (4603) anchors map to the human genome and no other reference; 35% (30%) map to both the Rfam and human genome; 6% (7%) have no map to any reference used for annotation which includes repetitive and mobile elements. As an example, 9 and 18 such anchors (donor 1 and 2 respectively) BLAST to MHC alleles in the NCBI database.
Immune single-cell analysis
To study human B and T cells, we utilize Tabula Sapiens data (Smart-seq2)34, downloaded from https://tabula-sapiens-portal.ds.czbiohub.org/; B cells were used from donor 1 (TSP1) and CD4+ T cells from donor 2 (TSP2). Mouse lemur single-cell RNA-seq data used in this study was generated as part of the Tabula Microcebus consortium36; the FASTQ files were downloaded from https://tabula-microcebus.ds.czbiohub.org. B cells and natural killer T cells were analyzed separately; both were from spleen and were a mixture of individual L2 and L4. To determine the most frequent transcriptome annotation for a dataset, all significant anchors were mapped to the human transcriptome (GRCh38, Gencode) with bowtie2, using default parameters and `-k 1` to report at most one alignment per anchor (Data S4). Then, the bowtie2 transcript hits are aggregated by counting over anchors. The transcript hits with the highest counts over all anchors were reported. Protein domain profiling was performed as described above.
SPLASH comparison to BASIC analysis in lemur spleen B cells
As part of the Tabula Microcebus consortium36, mouse lemur B cells were annotated with BASIC38 to identify Ig variable domains. However, BASIC was unable to identify the light chain variable domain in 35 cells. We used a simple approach to see if SPLASH could identify variable domains in these uncalled B cells. We checked the SPLASH genome annotations for these cells for anchors mapping to human “IGL” or “IGK” genes; there were only five such anchors, all to IGL, and these were found in only eight cells. For those eight cells, we retrieved the SPLASH consensus sequences for these anchors, which ranged from 2–5 per cell. Where consensuses for a cell overlapped, one was chosen, and these were submitted to BLAST against the nr/nt database. Many hits were to “immunoglobulin lambda-like polypeptide” 1 or 5 (IGLL), surrogate light chain genes that contain sequence similar to lambda J and C regions (as well as a unique N-terminal region) and so could mimic alignment to a true lambda variable domain. Therefore, BLAST alignments were checked to see whether the match could be assigned to V, J, C, or IGLL-unique regions. 4 cells matched C-region, 2 matched IGLL-unique region, and 2 had sequence beyond J-region (presumed V-region). For the latter two, we attempted to extend the consensus further into the V-region by `grep` in raw reads; one could not be extended as it only had adaptors adjacent to J-region sequence. The other consensus was extended through the full V-region, and its sequence is given in Note S4, along with the IGLL-unique matches.
SPLASH for Zostera marina (eelgrass) and Octopus bimaculoides
Data was downloaded from SRP327909 (eelgrass42), and SRP278619 (Octopus43), using nf-core fetchngs run in default mode74 and preprocessed with FASTP75 run in default mode to mitigate false positive calls due to adapter concatenation. An updated version of SPLASH54 was run with gap length=0, anchor_unique_targets_threshold=1, anchor_count_threshold=50, anchor_samples_threshold=1, anchor_sample_counts_threshold=5, and excluding anchor-targets containing poly A / C / G / T run of length 8. 500 pairs of random c and f were chosen. Fastp v0.3.9 was installed on 2/23/23 using bioconda.
Anchor-targets were selected from the SPLASH calls if they had homopolymer length ≤ 5, effect size > 0.1, and corrected p-value < 0.01. Element annotations were run using SPLASH commit ID 728066b; anchor-targets mapping to UniVec, Illumina adapters, SARS-CoV-2, or grass carp were removed. Anchor-targets were aligned with STAR 2.7.5 to a reference index generated from either the O. bimaculoides reference genome76 (NC_068981.1) and transcriptome43; or the Z. marina nuclear genome77 (v3.1, https://phytozome-next.jgi.doe.gov/info/Zmarina_v3_1) and mitochondrial and chloroplast genomes78 (NC_035345.1 and NC_036014.1, respectively).
For protein domain profiling, anchor-targets with no element annotation were in silico translated in all six frames and submitted to HMMer search of the Pfam database. For Figure 5B (eelgrass), anchors were ordered by descending number of observations; the top 200,000 were concatenated with their anchor-targets and submitted to Pfam; these anchor-targets are defined as controls. The best Pfam hit by full sequence E-value is retained for each anchor-target. For each domain, the top 4 domains hit by anchor-targets and by controls are plotted, where bar height is the number of anchor-targets that hit the domain with a full-sequence E-value < 0.01.
For BLAST analysis, we selected anchors with no more than 1 target mapping with either STAR to the reference genome or Bowtie2 element annotations, thus cases where no sample-specific variation would be detected if a reference genome were used. In eelgrass, targets were selected if their fraction exceeded 0.5 and their anchor’s effect size exceeded 0.9; in octopus, targets were selected if their fraction exceeded 0.3 and if their anchor’s effect size exceeded 0.8. The 1808 anchor-target pairs satisfying these criteria in octopus were submitted to BLAST with parameters:
-db nt -evalue 0.1 -task blastn -dust no -word_size 24 -reward 1 -penalty −3 -max_target_seqs 4
BLAST hits were merged into SPLASH output, with an indicator variable for whether the sequence was queried. For octopus anchor-targets, 1061/1808 had a BLAST hit (max E-value 0.028). There were 288 hits to octopus, of which 281 were annotated as from O. sinensis and 7 from O. bimaculoides. Selected sequences mapping to O. sinesis were further analyzed. For eelgrass, 1606/4081 had a BLAST hit (max E-value 0.028).
SPLASH output, merged with Pfam and BLAST analyses, are in Data S6 and Data S7 (octopus and eelgrass, respectively).
Supplementary Material
Acknowledgements
We thank Elisabeth Meyer; Arjun Rustagi for extensive discussion and assistance in interpretation of viral biology; Roozbeh Dehghannasiri for selecting and collating data from HLCA/TSP/Tabula Microcebus analysis; the Tabula Microcebus Consortium (Camille Ezran and Hannah Frank) for data sharing and discussion of B and T cell receptor detection algorithms; Jessica Klein for extensive figure assistance; Michael Swift for discussion of B and T cell receptor detection algorithms; Julia Olivieri for feedback on presentation and detailed comments on the manuscript; Andy Fire for useful discussions and feedback on the manuscript; Robert Bierman for figure assistance and laptop timing; Sebastian Deorwicz and Marek Kokot for early access to code, and Aaron Straight, Catherine Blish, Peter Kim, Alex Starr and all members of the Salzman lab for feedback and comments.
Funding
J.S. is supported by the Stanford University Discovery Innovation Award, National Institute of General Medical Sciences grant nos. R35 GM139517 and the National Science Foundation Faculty Early Career Development Program Award no. MCB1552196. T.Z.B. is supported by the National Science Foundation Graduate Research Fellowship Program and the Stanford Graduate Fellowship.
Funding Statement
J.S. is supported by the Stanford University Discovery Innovation Award, National Institute of General Medical Sciences grant nos. R35 GM139517 and the National Science Foundation Faculty Early Career Development Program Award no. MCB1552196. T.Z.B. is supported by the National Science Foundation Graduate Research Fellowship Program and the Stanford Graduate Fellowship.
Footnotes
Competing interests
K.C., T.Z.B. and J.S. are inventors on provisional patents related to this work. The authors declare no other competing interests.
Open-source figure attributions
Person graphic, by Tanguy Krl: https://thenounproject.com/icon/person-1218528
Virus graphic, by Nuttapon Pohnprompratahn: https://thenounproject.com/icon/virus-2198681.
Gears graphic, by Tresnatiq: https://thenounproject.com/icon/gears-1088494.
Clock graphic, by sudan: https://thenounproject.com/icon/clock-5677937.
Book graphic, by Arjan: https://thenounproject.com/icon/open-book-1361747.
SARS-CoV-2 virion, by Centers for Disease Control and Prevention (CDC): https://commons.wikimedia.org/wiki/File:SARS-CoV-2_without_background.png
Octopus bimaculatus, by United States Fish Commission (1910): https://commons.wikimedia.org/wiki/File:Octopus_bimaculatus1.jpg
Diatom, by George Swann: https://commons.wikimedia.org/wiki/File:Diatom_4.png
Zostera marina, by Carl Axel Magnus Lindman: https://commons.wikimedia.org/wiki/File:491_Zostera_marina.jpg
Data and materials availability
The code used in this work is available as a fully-containerized Nextflow pipeline56 at https://github.com/salzmanlab/nomad, commit 1b73949.
The human lung scRNA-seq data used here is accessible through the European Genome-phenome Archive (accession number: EGAS00001004344); FASTQ files from donor 1 and donor 2 were used. The FASTQ files for the Tabula Sapiens data (Smart-seq2) were downloaded from https://tabula-sapiens-portal.ds.czbiohub.org/; B cells were used from donor 1 (TSP1) and T cells from donor 2 (TSP2). The mouse lemur single-cell RNA-seq data used in this study was generated as part of the Tabula Microcebus consortium; the FASTQ files were downloaded from https://tabula-microcebus.ds.czbiohub.org. Viral data was downloaded from the NCBI: SARS-CoV-2 from France (SRP365166), SARS-CoV-2 from South Africa (SRP348159), and rotavirus (SRP328899).
The sample sheets used as pipeline input, including individual sample SRA accession numbers, for all analyses are uploaded to pipeline GitHub repository. Similarly, scripts to perform supplemental analysis can be found on the pipeline repository.
FASTP v0.23.2 was installed on 2/15/23 using bioconda. R-SPLASH v0.3.9 was installed on 2/23/23, commit 5dafdc8.
References
- 1.Sherman R.M., Forman J., Antonescu V., Puiu D., Daya M., Rafaels N., Boorgula M.P., Chavan S., Vergara C., Ortega V.E., et al. (2019). Assembly of a pan-genome from deep sequencing of 910 humans of African descent. Nat. Genet. 51, 30–35. 10.1038/s41588-018-0273-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Domingo E., and Perales C. (2019). Viral quasispecies. PLoS Genet. 15, e1008271. 10.1371/journal.pgen.1008271. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Tettelin H., and Medini D. eds. (2020). The Pangenome. [PubMed]
- 4.Nurk S., Koren S., Rhie A., Rautiainen M., Bzikadze A.V., Mikheenko A., Vollger M.R., Altemose N., Uralsky L., Gershman A., et al. (2022). The complete sequence of a human genome. Science 376, 44–53. 10.1126/science.abj6987. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Romano Y., Sesia M., and Candès E. (2019). Deep Knockoffs. J. Am. Stat. Assoc., 1–27. 10.1080/01621459.2019.1660174.34012183 [DOI] [Google Scholar]
- 6.Chung E., and Romano J.P. (2013). Exact and asymptotically robust permutation tests. Ann. Statist. 41, 484–507. 10.1214/13-AOS1090. [DOI] [Google Scholar]
- 7.Castel S.E., Levy-Moonshine A., Mohammadi P., Banks E., and Lappalainen T. (2015). Tools and best practices for data processing in allelic expression analysis. Genome Biol. 16, 195. 10.1186/s13059-015-0762-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Viana R., Moyo S., Amoako D.G., Tegally H., Scheepers C., Althaus C.L., Anyaneji U.J., Bester P.A., Boni M.F., Chand M., et al. (2022). Rapid epidemic expansion of the SARS-CoV-2 Omicron variant in southern Africa. Nature 603, 679–686. 10.1038/s41586-022-04411-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Bal A., Simon B., Destras G., Chalvignac R., Semanas Q., Oblette A., Quéromès G., Fanget R., Regue H., Morfin F., et al. (2022). Detection and prevalence of SARS-CoV-2 co-infections during the Omicron variant circulation in France. Nat. Commun. 13, 6316. 10.1038/s41467-022-33910-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Hodcroft E.B. (2021). CoVariants: SARS-CoV-2 Mutations and Variants of Interest. https://covariants.org/faq#what-are-defining-mutations.
- 11.Hodcroft E.B., Domman D.B., Snyder D.J., Oguntuyo K.Y., Van Diest M., Densmore K.H., Schwalm K.C., Femling J., Carroll J.L., Scott R.S., et al. (2021). Emergence in late 2020 of multiple lineages of SARS-CoV-2 Spike protein variants affecting amino acid position 677. medRxiv. 10.1101/2021.02.12.21251658. [DOI] [Google Scholar]
- 12.Colson P., Delerce J., Burel E., Beye M., Fournier P.-E., Levasseur A., Lagier J.-C., and Raoult D. (2022). Occurrence of a substitution or deletion of SARS-CoV-2 spike amino acid 677 in various lineages in Marseille, France. Virus Genes 58, 53–58. 10.1007/s11262-021-01877-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Zeng C., Evans J.P., Faraone J.N., Qu P., Zheng Y.-M., Saif L., Oltz E.M., Lozanski G., Gumina R.J., and Liu S.-L. (2021). Neutralization of SARS-CoV-2 Variants of Concern Harboring Q677H. MBio 12, e0251021. 10.1128/mBio.02510-21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Mistry J., Chuguransky S., Williams L., Qureshi M., Salazar G.A., Sonnhammer E.L.L., Tosatto S.C.E., Paladin L., Raj S., Richardson L.J., et al. (2021). Pfam: The protein families database in 2021. Nucleic Acids Res. 49, D412–D419. 10.1093/nar/gkaa913. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Qi H., Liu B., Wang X., and Zhang L. (2022). The humoral response and antibodies against SARS-CoV-2 infection. Nat. Immunol. 23, 1008–1020. 10.1038/s41590-022-01248-5. [DOI] [PubMed] [Google Scholar]
- 16.Voss W.N., Hou Y.J., Johnson N.V., Delidakis G., Kim J.E., Javanmardi K., Horton A.P., Bartzoka F., Paresi C.J., Tanno Y., et al. (2021). Prevalent, protective, and convergent IgG recognition of SARS-CoV-2 non-RBD spike epitopes. Science 372, 1108–1112. 10.1126/science.abg5268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Simsek C., Bloemen M., Jansen D., Beller L., Descheemaeker P., Reynders M., Van Ranst M., and Matthijnssens J. (2021). High prevalence of coinfecting enteropathogens in suspected rotavirus vaccine breakthrough cases. J. Clin. Microbiol. 59, e0123621. 10.1128/JCM.01236-21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Antia A., Pinski A.N., and Ding S. (2022). Re-Examining Rotavirus Innate Immune Evasion: Potential Applications of the Reverse Genetics System. MBio 13, e0130822. 10.1128/mbio.01308-22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Picelli S., Faridani O.R., Björklund A.K., Winberg G., Sagasser S., and Sandberg R. (2014). Full-length RNA-seq from single cells using Smart-seq2. Nat. Protoc. 9, 171–181. 10.1038/nprot.2014.006. [DOI] [PubMed] [Google Scholar]
- 20.Travaglini K.J., Nabhan A.N., Penland L., Sinha R., Gillich A., Sit R.V., Chang S., Conley S.D., Mori Y., Seita J., et al. (2019). A molecular cell atlas of the human lung from single cell RNA sequencing. BioRxiv. 10.1101/742320. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Olivieri J.E., Dehghannasiri R., Wang P.L., Jang S., de Morree A., Tan S.Y., Ming J., Ruohao Wu A., Tabula Sapiens Consortium, Quake S.R., et al. (2021). RNA splicing programs define tissue compartments and cell types at single-cell resolution. eLife 10. 10.7554/eLife.70692. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Grant J.W., Taubman M.B., Church S.L., Johnson R.L., and Nadal-Ginard B. (1990). Mammalian nonsarcomeric myosin regulatory light chains are encoded by two differentially regulated and linked genes. J. Cell Biol. 111, 1127–1135. 10.1083/jcb.111.3.1127. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Vedula P., Kurosaka S., Leu N.A., Wolf Y.I., Shabalina S.A., Wang J., Sterling S., Dong D.W., and Kashina A. (2017). Diverse functions of homologous actin isoforms are defined by their nucleotide, rather than their amino acid sequence. eLife 6. 10.7554/eLife.31661. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Trowsdale J., and Knight J.C. (2013). Major histocompatibility complex genomics and human disease. Annu. Rev. Genomics Hum. Genet. 14, 301–323. 10.1146/annurev-genom-091212-153455. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Daar A.S., Fuggle S.V., Fabre J.W., Ting A., and Morris P.J. (1984). The detailed distribution of MHC Class II antigens in normal human organs. Transplantation 38, 293–298. 10.1097/00007890-198409000-00019. [DOI] [PubMed] [Google Scholar]
- 26.Pober J.S., Merola J., Liu R., and Manes T.D. (2017). Antigen presentation by vascular cells. Front. Immunol. 8, 1907. 10.3389/fimmu.2017.01907. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Andersson G., Svensson A.C., Setterblad N., and Rask L. (1998). Retroelements in the human MHC class II region. Trends Genet. 14, 109–114. 10.1016/s0168-9525(97)01359-0. [DOI] [PubMed] [Google Scholar]
- 28.hla.alleles.org (2023). Numbers of HLA Alleles. HLA Informatics Group, Anthony Nolan Research Institute. https://hla.alleles.org/nomenclature/stats.html. [Google Scholar]
- 29.Tian R., Zhu H., Pang Z., Tian Y., and Liang C. (2019). Extraordinary diversity of HLA class I gene expression in single cells contribute to the plasticity and adaptability of human immune system. BioRxiv. 10.1101/725119. [DOI] [Google Scholar]
- 30.Gutierrez-Arcelus M., Baglaenko Y., Arora J., Hannes S., Luo Y., Amariuta T., Teslovich N., Rao D.A., Ermann J., Jonsson A.H., et al. (2020). Allele-specific expression changes dynamically during T cell activation in HLA and other autoimmune loci. Nat. Genet. 52, 247–253. 10.1038/s41588-020-0579-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Kang J.B., Shen A.Z., Sakaue S., Luo Y., Gurajala S., Nathan A., Rumker L., Aguiar V.R.C., Valencia C., Lagattuta K., et al. (2023). Mapping the dynamic genetic regulatory architecture of HLA genes at single-cell resolution. medRxiv. 10.1101/2023.03.14.23287257. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Briney B., Inderbitzin A., Joyce C., and Burton D.R. (2019). Commonality despite exceptional diversity in the baseline human antibody repertoire. Nature 566, 393–397. 10.1038/s41586-019-0879-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Teraguchi S., Saputri D.S., Llamas-Covarrubias M.A., Davila A., Diez D., Nazlica S.A., Rozewicki J., Ismanto H.S., Wilamowski J., Xie J., et al. (2020). Methods for sequence and structural analysis of B and T cell receptor repertoires. Comput. Struct. Biotechnol. J. 18, 2000–2011. 10.1016/j.csbj.2020.07.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Tabula Sapiens Consortium*, Jones R.C., Karkanias J., Krasnow M.A., Pisco A.O., Quake S.R., Salzman J., Yosef N., Bulthaup B., Brown P., et al. (2022). The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science 376, eabl4896. 10.1126/science.abl4896. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Ezran C., Karanewsky C.J., Pendleton J.L., Sholtz A., Krasnow M.R., Willick J., Razafindrakoto A., Zohdy S., Albertelli M.A., and Krasnow M.A. (2017). The mouse lemur, a genetic model organism for primate biology, behavior, and health. Genetics 206, 651–664. 10.1534/genetics.116.199448. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.The Tabula Microcebus Consortium, Ezran C., Liu S., Chang S., Ming J., Botvinnik O., Penland L., Tarashansky A., de Morree A., Travaglini K.J., et al. (2021). Tabula Microcebus: A transcriptomic cell atlas of mouse lemur, an emerging primate model organism. BioRxiv. 10.1101/2021.12.12.469460. [DOI] [Google Scholar]
- 37.Heller N.M., Berga-Bolanos R., Naler L., and Sen J.M. (2018). Natural killer T (NKT) cells in mice and men. In Signaling mechanisms regulating T cell diversity and function, Soboloff J. and Kappes D. J., eds. (CRC Press/Taylor & Francis; ). 10.1201/9781315371689-8. [DOI] [PubMed] [Google Scholar]
- 38.Canzar S., Neu K.E., Tang Q., Wilson P.C., and Khan A.A. (2017). BASIC: BCR assembly from single cells. Bioinformatics 33, 425–427. 10.1093/bioinformatics/btw631. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Übelhart R., Werner M., and Jumaa H. (2016). Assembly and Function of the Precursor B-Cell Receptor. Curr. Top. Microbiol. Immunol. 393, 3–25. 10.1007/82_2015_475. [DOI] [PubMed] [Google Scholar]
- 40.Liscovitch-Brauer N., Alon S., Porath H.T., Elstein B., Unger R., Ziv T., Admon A., Levanon E.Y., Rosenthal J.J.C., and Eisenberg E. (2017). Trade-off between Transcriptome Plasticity and Genome Evolution in Cephalopods. Cell 169, 191–202.e11. 10.1016/j.cell.2017.03.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Röhr M.E., Holmer M., Baum J.K., Björk M., Chin D., Chalifour L., Cimon S., Cusson M., Dahl M., Deyanova D., et al. (2018). Blue carbon storage capacity of temperate eelgrass (Zostera marina) meadows. Global Biogeochem. Cycles. 10.1029/2018GB005941. [DOI] [Google Scholar]
- 42.Jueterbock A., Duarte B., Coyer J., Olsen J.L., Kopp M.E.L., Smolina I., Arnaud-Haond S., Hu Z.-M., and Hoarau G. (2021). Adaptation of temperate seagrass to arctic light relies on seasonal acclimatization of carbon capture and metabolism. Front. Plant Sci. 12, 745855. 10.3389/fpls.2021.745855. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.van Giesen L., Kilian P.B., Allard C.A.H., and Bellono N.W. (2020). Molecular basis of chemotactile sensation in octopus. Cell 183, 594–604.e14. 10.1016/j.cell.2020.09.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Wu L., Pan L., Wei Z., and Zhang M. (2011). Structure of MyTH4-FERM domains in myosin VIIa tail bound to cargo. Science 331, 757–760. 10.1126/science.1198848. [DOI] [PubMed] [Google Scholar]
- 45.Solé M., Lenoir M., Durfort M., López-Bejar M., Lombarte A., and André M. (2013). Ultrastructural damage of Loligo vulgaris and Illex coindetii statocysts after low frequency sound exposure. PLoS ONE 8, e78825. 10.1371/journal.pone.0078825. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Zhang Y., Shi F., Song J., Zhang X., and Yu S. (2015). Hearing characteristics of cephalopods: modeling and environmental impact study. Integr. Zool. 10, 141–151. 10.1111/1749-4877.12104. [DOI] [PubMed] [Google Scholar]
- 47.Borowitzka M.A., Lavery P.S., and van Keulen M. (2006). Epiphytes of Seagrasses. In Seagrasses: Biology, Ecology and Conservation, Larkum A. W. D., Orth R. J., and Duarte C. M., eds. (Springer; Netherlands: ), pp. 441–461. 10.1007/978-1-4020-2983-7_19. [DOI] [Google Scholar]
- 48.Prazukin A.V., Lee R.I., Firsov Y.K., and Kapranov S.V. (2022). Vertical distribution of epiphytic diatoms in relation to the eelgrass Zostera noltii canopy biomass and height. Aquatic Botany 176, 103466. 10.1016/j.aquabot.2021.103466. [DOI] [Google Scholar]
- 49.Cox T.E., Cebrian J., Tabor M., West L., and Krause J.W. (2020). Do diatoms dominate benthic production in shallow systems? A case study from a mixed seagrass bed. Limnol. Oceanogr. 5, 425–434. 10.1002/lol2.10167. [DOI] [Google Scholar]
- 50.Xu C., Pi X., Huang Y., Han G., Chen X., Qin X., Huang G., Zhao S., Yang Y., Kuang T., et al. (2020). Structural basis for energy transfer in a huge diatom PSI-FCPI supercomplex. Nat. Commun. 11, 5081. 10.1038/s41467-020-18867-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Laughlin T.G., Savage D.F., and Davies K.M. (2020). Recent advances on the structure and function of NDH-1: The complex I of oxygenic photosynthesis. Biochim. Biophys. Acta Bioenerg. 1861, 148254. 10.1016/j.bbabio.2020.148254. [DOI] [PubMed] [Google Scholar]
- 52.Ma M., Liu Y., Bai C., and Yong J.W.H. (2021). The significance of chloroplast NAD(P)H dehydrogenase complex and its dependent cyclic electron transport in photosynthesis. Front. Plant Sci. 12, 661863. 10.3389/fpls.2021.661863. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Karlin S., and Altschul S.F. (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA 87, 2264–2268. 10.1073/pnas.87.6.2264. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Kokot M., Dehghannasiri R., Baharav T., Salzman J., and Deorowicz S. (2023). NOMAD2 provides ultra-efficient, scalable, and unsupervised discovery on raw sequencing reads. BioRxiv. 10.1101/2023.03.17.533189. [DOI] [PubMed] [Google Scholar]
- 55.Baharav T.Z., Tse D., and Salzman J. (2023). An interpretable, finite sample valid alternative to Pearson’s X2 for scientific discovery. BioRxiv. 10.1101/2023.03.16.533008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Di Tommaso P., Chatzou M., Floden E.W., Barja P.P., Palumbo E., and Notredame C. (2017). Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316–319. 10.1038/nbt.3820. [DOI] [PubMed] [Google Scholar]
- 57.Salzman J., Jiang H., and Wong W.H. (2011). Statistical Modeling of RNA-Seq Data. Stat. Sci. 26. 10.1214/10-STS343. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Benjamini Y., and Yekutieli D. (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Statist. 29, 1165–1188. 10.1214/aos/1013699998. [DOI] [Google Scholar]
- 59.Ståhlberg A., Krzyzanowski P.M., Jackson J.B., Egyud M., Stein L., and Godfrey T.E. (2016). Simple, multiplexed, PCR-based barcoding of DNA enables sensitive mutation detection in liquid biopsies using sequencing. Nucleic Acids Res. 44, e105. 10.1093/nar/gkw224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Motahari A., Ramchandran K., Tse D., and Ma N. (2013). Optimal DNA shotgun sequencing: Noisy reads are as good as noiseless reads. In 2013 IEEE International Symposium on Information Theory (IEEE; ), pp. 1640–1644. 10.1109/ISIT.2013.6620505. [DOI] [Google Scholar]
- 61.Abante J., Wang P.L., and Salzman J. (2022). DIVE: a reference-free statistical approach to diversity-generating and mobile genetic element discovery. BioRxiv. 10.1101/2022.06.13.495703. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Wu C.-S., Ma Z.-Y., Zheng G.-D., Zou S.-M., Zhang X.-J., and Zhang Y.-A. (2022). Chromosome-level genome assembly of grass carp (Ctenopharyngodon idella) provides insights into its genome evolution. BMC Genomics 23, 271. 10.1186/s12864-022-08503-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Kalvari I., Nawrocki E.P., Ontiveros-Palacios N., Argasinska J., Lamkiewicz K., Marz M., Griffiths-Jones S., Toffano-Nioche C., Gautheret D., Weinberg Z., et al. (2021). Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res. 49, D192–D200. 10.1093/nar/gkaa1047. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Storer J., Hubley R., Rosen J., Wheeler T.J., and Smit A.F. (2021). The Dfam community resource of transposable element families, sequence models, and genome annotations. Mob. DNA 12, 2. 10.1186/s13100-020-00230-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Ross K., Varani A.M., Snesrud E., Huang H., Alvarenga D.O., Zhang J., Wu C., McGann P., and Chandler M. (2021). Tncentral: a prokaryotic transposable element database and web portal for transposon analysis. MBio 12, e0206021. 10.1128/mBio.02060-21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Leplae R., Hebrant A., Wodak S.J., and Toussaint A. (2004). ACLAME: a CLAssification of Mobile genetic Elements. Nucleic Acids Res. 32, D45–9. 10.1093/nar/gkh084. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Bi D., Xu Z., Harrison E.M., Tai C., Wei Y., He X., Jia S., Deng Z., Rajakumar K., and Ou H.-Y. (2012). ICEberg: a web-based resource for integrative and conjugative elements found in Bacteria. Nucleic Acids Res. 40, D621–6. 10.1093/nar/gkr846. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Couvin D., Bernheim A., Toffano-Nioche C., Touchon M., Michalik J., Néron B., Rocha E.P.C., Vergnaud G., Gautheret D., and Pourcel C. (2018). CRISPRCasFinder, an update of CRISRFinder, includes a portable version, enhanced performance and integrates search for Cas proteins. Nucleic Acids Res. 46, W246–W251. 10.1093/nar/gky425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Santamaria M., Fosso B., Licciulli F., Balech B., Larini I., Grillo G., De Caro G., Liuni S., and Pesole G. (2018). ITSoneDB: a comprehensive collection of eukaryotic ribosomal RNA Internal Transcribed Spacer 1 (ITS1) sequences. Nucleic Acids Res. 46, D127–D132. 10.1093/nar/gkx855. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Selig C., Wolf M., Müller T., Dandekar T., and Schultz J. (2008). The ITS2 Database II: homology modelling RNA structure for molecular systematics. Nucleic Acids Res. 36, D377–80. 10.1093/nar/gkm827. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Dobin A., Davis C.A., Schlesinger F., Drenkow J., Zaleski C., Jha S., Batut P., Chaisson M., and Gingeras T.R. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21. 10.1093/bioinformatics/bts635. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Shen W., Le S., Li Y., and Hu F. (2016). SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS ONE 11, e0163962. 10.1371/journal.pone.0163962. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Johnson L.S., Eddy S.R., and Portugaly E. (2010). Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinformatics 11, 431. 10.1186/1471-2105-11-431. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Ewels P.A., Peltzer A., Fillinger S., Patel H., Alneberg J., Wilm A., Garcia M.U., Di Tommaso P., and Nahnsen S. (2020). The nf-core framework for community-curated bioinformatics pipelines. Nat. Biotechnol. 38, 276–278. 10.1038/s41587-020-0439-x. [DOI] [PubMed] [Google Scholar]
- 75.Chen S., Zhou Y., Chen Y., and Gu J. (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890. 10.1093/bioinformatics/bty560. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Albertin C.B., Simakov O., Mitros T., Wang Z.Y., Pungor J.R., Edsinger-Gonzales E., Brenner S., Ragsdale C.W., and Rokhsar D.S. (2015). The octopus genome and the evolution of cephalopod neural and morphological novelties. Nature 524, 220–224. 10.1038/nature14668. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Ma X., Olsen J.L., Reusch T.B.H., Procaccini G., Kudrna D., Williams M., Grimwood J., Rajasekar S., Jenkins J., Schmutz J., et al. (2021). Improved chromosome-level genome assembly and annotation of the seagrass, Zostera marina (eelgrass). F1000Res. 10, 289. 10.12688/f1000research.38156.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Olsen J.L., Rouzé P., Verhelst B., Lin Y.-C., Bayer T., Collen J., Dattolo E., De Paoli E., Dittami S., Maumus F., et al. (2016). The genome of the seagrass Zostera marina reveals angiosperm adaptation to the sea. Nature 530, 331–335. 10.1038/nature16548. [DOI] [PubMed] [Google Scholar]
- 79.Medhekar B., and Miller J.F. (2007). Diversity-generating retroelements. Curr. Opin. Microbiol. 10, 388–395. 10.1016/j.mib.2007.06.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Agresti A. (1992). A Survey of Exact Inference for Contingency Tables. Stat. Sci. 7, 131–153. 10.1214/ss/1177011454. [DOI] [Google Scholar]
- 81.Fisher R.A. (1922). On the Interpretation of X2 from Contingency Tables, and the Calculation of P. Journal of the Royal Statistical Society 85, 87. 10.2307/2340521. [DOI] [Google Scholar]
- 82.Diaconis P., and Sturmfels B. (1998). Algebraic algorithms for sampling from conditional distributions. Ann. Statist. 26. 10.1214/aos/1030563990. [DOI] [Google Scholar]
- 83.Chen Y., Diaconis P., Holmes S.P., and Liu J.S. (2005). Sequential monte carlo methods for statistical analysis of tables. J. Am. Stat. Assoc. 100, 109–120. 10.1198/016214504000001303. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The code used in this work is available as a fully-containerized Nextflow pipeline56 at https://github.com/salzmanlab/nomad, commit 1b73949.
The human lung scRNA-seq data used here is accessible through the European Genome-phenome Archive (accession number: EGAS00001004344); FASTQ files from donor 1 and donor 2 were used. The FASTQ files for the Tabula Sapiens data (Smart-seq2) were downloaded from https://tabula-sapiens-portal.ds.czbiohub.org/; B cells were used from donor 1 (TSP1) and T cells from donor 2 (TSP2). The mouse lemur single-cell RNA-seq data used in this study was generated as part of the Tabula Microcebus consortium; the FASTQ files were downloaded from https://tabula-microcebus.ds.czbiohub.org. Viral data was downloaded from the NCBI: SARS-CoV-2 from France (SRP365166), SARS-CoV-2 from South Africa (SRP348159), and rotavirus (SRP328899).
The sample sheets used as pipeline input, including individual sample SRA accession numbers, for all analyses are uploaded to pipeline GitHub repository. Similarly, scripts to perform supplemental analysis can be found on the pipeline repository.
FASTP v0.23.2 was installed on 2/15/23 using bioconda. R-SPLASH v0.3.9 was installed on 2/23/23, commit 5dafdc8.