Skip to main content
mSphere logoLink to mSphere
. 2024 Jun 21;9(7):e00139-24. doi: 10.1128/msphere.00139-24

Synthetic lethality and the minimal genome size problem

Sara Rahiminejad 1, Bianca De Sanctis 2,3, Pavel Pevzner 4, Arcady Mushegian 5,6,
Editor: Geraldine Butler7
PMCID: PMC11288024  PMID: 38904396

ABSTRACT

Gene knockout studies suggest that ~300 genes in a bacterial genome and ~1,100 genes in a yeast genome cannot be deleted without loss of viability. These single-gene knockout experiments do not account for negative genetic interactions, when two or more genes can each be deleted without effect, but their joint deletion is lethal. Thus, large-scale single-gene deletion studies underestimate the size of a minimal gene set compatible with cell survival. In yeast Saccharomyces cerevisiae, the viability of all possible deletions of gene pairs (2-tuples), and of some deletions of gene triplets (3-tuples), has been experimentally tested. To estimate the size of a yeast minimal genome from that data, we first established that finding the size of a minimal gene set is equivalent to finding the minimum vertex cover in the lethality (hyper)graph, where the vertices are genes and (hyper)edges connect k-tuples of genes whose joint deletion is lethal. Using the Lovász-Johnson-Chvatal greedy approximation algorithm, we computed the minimum vertex cover of the synthetic-lethal 2-tuples graph to be 1,723 genes. We next simulated the genetic interactions in 3-tuples, extrapolating from the existing triplet sample, and again estimated minimum vertex covers. The size of a minimal gene set in yeast rapidly approaches the size of the entire genome even when considering only synthetic lethalities in k-tuples with small k. In contrast, several studies reported successful experimental reductions of yeast and bacterial genomes by simultaneous deletions of hundreds of genes, without eliciting synthetic lethality. We discuss possible reasons for this apparent contradiction.

IMPORTANCE

How can we estimate the smallest number of genes sufficient for a unicellular organism to survive on a rich medium? One approach is to remove genes one at a time and count how many of such deletion strains are unable to grow. However, the single-gene knockout data are insufficient, because joint gene deletions may result in negative genetic interactions, also known as synthetic lethality. We used a technique from graph theory to estimate the size of minimal yeast genome from partial data on synthetic lethality. The number of potential synthetic lethal interactions grows very fast when multiple genes are deleted, revealing a paradoxical contrast with the experimental reductions of yeast genome by ~100 genes, and of bacterial genomes by several hundreds of genes.

KEYWORDS: synthetic lethality, minimal gene set, minimal genome size, yeast, bacteria

INTRODUCTION

A minimal gene set, or minimal genome, is “the smallest possible group of genes that would be sufficient to sustain a functioning cellular life form under the most favorable conditions imaginable”' (1). In principle, an organism could have several minimal genomes consisting of different gene sets, each with the same total number of genes. The minimal gene set for the same genome may also differ depending on the environmental variables, such as the composition of the growth medium. Minimal genomes are of interest because their content may provide clues to the fundamental engineering principles enabling cellular life and may inform the efforts to generate fully controllable chassis for biotechnological manufacturing of molecules (1, 2).

When defining a minimal gene set for a given species under specific conditions, we may be interested in the identity of all genes within a minimal gene set, or else, we may want to know just the size of the set, i.e., the number of elements in it. If we had a complete knowledge of every function of each gene, along with the understanding of how these genes and their products are organized to form a network that relays signals and processes metabolites, we could construct a detailed model of cellular metabolism, self-preservation, division, and reproduction. A minimal genome could then be determined by simplifying this model as much as possible, while ensuring that it remains capable of sustaining cell viability; this approach could give both the identity and the number of genes in the minimal genome. Even in the best-studied organisms, however, the metabolic and signaling networks are known only partially, the knowledge of the molecular functions of any gene is generally incomplete and the extent of gene pleiotropy is unknown, and molecular functions of many genes are not known at all. Therefore, the estimates of the minimal genome content and size have to rely on heuristic approaches. In this work, we are concerned with the estimation of the size of the set, rather than the identity or function of each gene within it. In the following, we measure the minimal genome sizes in the number of genes, and not, for example, in the sum of base pairs in all genes.

One class of computational heuristics to determine the minimal genome size involves comparisons of gene repertoires in completely sequenced genomes. The main idea is that the orthologous genes are shared by genomes because these genes are important for the organism survival, and that with more genomes and with larger evolutionary distances between them, the shared gene set would approximate a minimal essential genome. In 1996, the gene complements of the two bacterial genomes that were the only genomes completely sequenced at the time, Gram-negative Haemophilus influenzae (~1,700 genes) and Gram-positive-related Mycoplasma genitalium (~470 genes), were compared (3). The orthologs were defined using an approach that later evolved into the well-known bidirectional best match analysis (for the review of modern methodology, see reference 4). There were 244 orthologous genes, and after supplementing them with a small number of clearly isofunctional genes that were not orthologous in two species, the first theoretical minimal genome of 256 genes was constructed (3, 5).

The methods to compute the shared gene sets were developed further (1, 2, 611), and their application to genome comparison showed that the shared set of orthologs rapidly decays with the addition of more species, especially when considering very large evolutionary distances, such as those that separate the main lineages of Bacteria and Archaea. Ultimately, only between 30 and 40 genes have orthologs in all cellular species (12). The genes in this small set belong to just a few functional modules, such as ribosome components, machinery for RNA synthesis, and mRNA translation, and are not expected to encode a self-sufficient cell. Other genes are needed to support the biosynthetic capacity of a cell and its interactions with the environment; genes in these categories are often not orthologous in different species, as most of the cell functions have been invented more than once in evolution (12, 13). Thus, counting only shared orthologs will tend to underestimate the true size of a minimal genome, perhaps severely so.

Experimental approaches to the definition of minimal genome are based on the analysis of gene knockouts. In a pioneering work, M. Itaya mutagenized 79 random chromosomal loci in Bacillus subtilis to show that 6 mutants no longer could form colonies on the solid medium, whereas 73 could grow unimpaired (14). The genome of B. subtilis remained unsequenced at the time; assuming that it consisted of ~4,000,000 base pairs, the minimal genome size was estimated to be 300–500 genes. More recently, many genomes have been completely sequenced, and large-scale genetic manipulations became possible in a wider variety of organisms. A common strategy to establish a set of essential genes nowadays is to conduct single-gene knockout experiments on the entire genome, one gene at a time. The reviews of the studies using this strategy can be found in (1523).

However, single-gene knockout experiments underestimate the size of a minimal genome as well. Here, the main reason is that single gene deletions do not account for genetic interactions such as synthetic lethalities, in which the simultaneous deletion of two non-essential genes results in a loss of viability, whereas their separate deletion by definition does not. This type of genetic interaction is common when two genes are functionally redundant, for example when they were produced in a recent duplication event, or when the two genes belong to two different pathways involved in production of the same metabolite. In genetic literature, negative gene interactions are also known as negative epistasis, or, in the context of evolution of reproductive isolation, as Dobzhansky-Muller incompatibility (2429). The notion of negative interaction may be generalized to three, four, or more genes, if their joint deletion or mutation is lethal but deletion/mutation of any subset of those genes is not.

The most detailed data for systematic deletions of groups of genes have been obtained for yeast: the viability of all possible single and double-gene deletions of ~6,200 protein-coding genes in yeast S.cerevisiae has been tested (18, 3033), and a sample of triple gene deletions has also been produced and tested (34, 35). Here, we use these single, double, and triple gene deletion data to establish an estimate of the minimal genome size for yeast, by re-framing the problem in the context of graph theory.

MATERIALS AND METHODS

Background and notation

A graph is a collection of objects such that some pairs of those objects have a special significance, e.g., they are said to be “related” or “connected.” Formally, a graph G is defined by two sets (V, E), where V is a set of elements called vertices, and E is a set of pairs of vertices, called edges.

A map of the metabolic and signaling networks in a cell is a type of a graph familiar to biologists; it can be defined so that the vertices represent metabolites or signaling molecules, and edges incident to a vertex are genes whose protein products make, break or interact with this molecule. Call such graph Gm. In Fig. 1A, an imaginary Gm is shown, in which the unlabeled vertices represent different metabolites, and edges are chemical conversions catalyzed by gene products, g1 through g6. (Here and below, we talk about metabolites and enzymatic conversions for concreteness, but other types of biological functions, for example signaling pathways and signal-relaying molecules, can be represented in the same framework rather easily). The sequence of edges that leads from one vertex to another, possibly passing through several vertices, is called a path. For some pairs of metabolites, there may be several paths connecting them. Note that two paths in Fig. 1A, one consisting of g5 and g6, and the other consisting of g2, g3, and g4, perform the same biochemical conversion, so any one of those two pathways may be sufficient for the cell survival.

Fig 1.

Fig 1

Relationship between two types of gene graphs. (A) A metabolic map / metabolic graph Gm. (B) The gene lethality graph Gl corresponding to the metabolic graph Gm. The minimum vertex cover of the graph consists of nodes g5 and g6 , and the minimal genome includes g1, g5, and g6 (indicated by green).

Consider now a lethality graph Gl, in which the vertices are genes, and the edges are pairs of genes whose joint deletion is lethal. Such a graph, representing the relationships of genes in Gm, is shown in Fig. 1B. Here, labeled vertices represent genes, and each edge represents a lethal phenotype when the two connected genes are deleted simultaneously (g1 is self-connected to indicate that its single deletion is lethal).

Let us denote a minimal gene set as M. If a single-gene deletion is lethal, the gene is part of a minimal genome by definition. We call the set of all such genes M1M, i.e., the component of a minimal genome identified by the deletions of single genes (1-tuples of genes). The size of M1 gives a lower bound on the size of M. Similarly, we can define M2M as the component of M identified by pairwise interactions or 2-tuples, M3M the component of M identified by 3-tuples, and more generally MkM the component of M identified by k-tuples, so that ∪ Mk = M. If k > 2, then the mathematical object under consideration is a hypergraph, a generalization of a graph in which each hyperedge connects more than 2 vertices at once.

Importantly, knowing all double deletions, or 2-tuples, that result in synthetic lethality does not immediately give us M2 , and it has to be computed from the data, as we explain below. The same is true for lethal deletions of k-tuples for k > 2. Note also that systematic deletion of all possible triples, quadruples, etc., of genes is physically prohibitive with the present-day technology, because of the very large number of possible k-tuples. A rather large subset of possible triple deletions in yeast, however, has been screened for viability (34, 35). Sampling of the k-tuples for for k > 3 has not yet been reported for any organism.

The data

For conciseness, we will call a k-tuple of genes “lethal,” if the deletion/absense of that k-tuple results in a lethal phenotype. The S.cerevisiae genome contains ~6,200 genes, including ~1,000–1,700 essential genes and ~4,500–5,200 non-essential genes for k = 1. The range of numbers reflects the differences in the experimental conditions between the genome-wide screens, including modifications in the gene definition algorithms (i.e., what counts as a gene), different experimental protocols for making gene knockouts, and the details of phenotype scoring methods, such as sensitivity to the threshold at which the continuous fitness variable is discretized into the viable vs. non-viable phenotypes (26, 30, 31, 33, 3537) .

A binary array of pairwise gene combinations which are not lethal when deleted individually, generated in connection with the studies (32, 33), has been provided to us by A. Baryshnikova (Calico Life Sciences) and described in more detail in a previous work (38). The rows and columns in the array/matrix represent the data on 4,457 non-essential yeast genes, and the matrix elements represent the binary outcomes (lethal or not) of double deletion of each gene pair. There are 9,930,196 possible pairwise combinations of 4,457 genes. Of these, 318,757 gene pairs, or 3.2% of the total, have been scored as lethal. We encoded information about these pairwise combinations in a graph with 4,457 vertices and 318,757 edges, which we call YEAST-K2. Each vertex in this graph represents a gene, and each edge connects two genes which cause lethality when jointly deleted. The matrix and the YEAST-K2 graph are available at https://github.com/srahimia/minimal_vc .

Greedy algorithm for finding minimum vertex covers

A hypergraph is the generalization of a graph in which edges can connect more than two vertices. Analogously to a graph, hypergraph H(V,E) is defined by its vertex set V and hyperedge set E. A subset of vertices in a hypergraph H is called a vertex cover if each hyperedge in the hypergraph contains at least one vertex from this vertex subset. A minimum vertex cover of a hypergraph is its vertex cover of minimal size.

In the context of gene deletion analysis, if the set of all lethal k-tuples of genes is known, a minimal genome must include at least one gene from each k-tuple, or at least one vertex in each (hyper)edge. Thus, given vertices and edges of the lethality graph, establishing a minimal genome is equivalent to finding a minimum vertex cover, or a minimum subset of vertices that includes at least one vertex from every (hyper)edge.

Finding the minimum vertex cover of a (hyper)graph is an NP-hard problem (39). However, a minimum vertex cover can be estimated with a greedy approximation algorithm, the Lovász-Johnson-Chvatal algorithm, LJC (40). The degree of a vertex in a hypergraph is the number of hyperedges to which that vertex belongs. At each step, the LJC algorithm (i) selects a vertex of maximal degree, (ii) removes all hyperedges incident to the selected vertex from the hyperedge set, (iii) updates the degrees of neighbors of the removed vertex (i.e., the vertices which shared hyperedges with the removed vertex), and (iv) adds the removed vertex to the vertex cover. The LJC algorithm runs iteratively until the hyperedge set is empty. The procedure generates an upper bound on the size of a minimum vertex cover of a hypergraph. The following pseudocode describes the LJC algorithm for a hypergraph H with vertex set V and hyperedge set E:

The Lovász–Johnson–Chvatal approximation algorithm LJC(H) for finding a minimum vertex cover in a hypergraph.

vertex cover ← empty set

while(hyperedge set of the hypergraph is non-empty)

v ← any vertex with maximal degree in the hypergraph

remove all hyperedges incident to v from the hyperedge set

remove v from the vertex-set of the hypergraph

add v to vertex cover

return vertex cover

Simulations

All lethal gene pairs are known empirically, but only a sample of lethal triples is known (3135). In order to apply the algorithm to the triplet data, hypergraphs were generated by extrapolating from existing data via simulation, in three different ways. In the first approach, hyperedges were assigned to each triple with a uniform probability 1/|V|3, where |V| is the number of vertices in the hypergraph (Method 1, or “uniform sampling”). In Method 2 (“weighted sampling”), a probability distribution P = p1, p2, p3, ... for all vertices in the YEAST-K2 graph was generated first, where pi is the ratio of the vertex i degree to the sum of degrees of all vertices in the graph. Then, a triple (u, v, w) was selected from P. The number of hyperedges (triples) added to the hypergraph was determined by the product $fT$, where T is the number of all possible triples in the hypergraph (nearly 1.51010 triples for 4,457 vertices), and the values of f were selected to vary between 10−7 and 10−4. Method 3 (“targeted sampling”) is similar to Method 2, but only those vertices are sampled which remain after M2 is removed.

RESULTS AND DISCUSSION

We compiled the genome-scale gene deletion data at k = 1 from the studies in bacteria, archaea, and unicellular fungi. The results of 41 studies, involving microorganisms with vastly different biology and more than 10-fold difference in genome sizes, are summarized in Fig. 2; Table S1; Fig. S1. The main observation is that when genes are deleted one at a time, the essential gene sets tend to consist of 300–400 genes in bacteria and archaea (Fig. 2A). This is within the bounds given by Itaya’s early extrapolation from a sample of gene disruptions (14), and in the ballpark of an early computational estimate (3). Four fungal genomes (S. cerevisiae, Sch. pombe, Pichia pastoris, and Candida albicans; top right corner in Fig. 2A) are the outliers—their essential gene complement is ~1,000–1,200 genes. The fact that fungi have many more essential genes than prokaryotes does not come as a surprise; it must reflect more complex cellular organization and life cycle in eukaryotes compared to bacteria and archaea. It must be noted also that the screens were performed mostly on the nutrient-rich media and that the mutant collection of only one bacterium (Shewanella) was profiled under multiple conditions (Table S1); these variations are not addressed here.

Fig 2.

Fig 2

The number (A) and proportion (B) of essential genes in unicellular organisms, determined by serial inactivation of single genes. Compiled from the data collected in the Table S1. The same plots are provided as Fig. S1, with each species labeled individually.

The trend in the percentages of essential genes is different (Fig. 2B). The percentage is inversely proportional to the genome size and is especially high in two species of mollicutes with very small genomes (Mycoplasma genitalium and Mycoplasma bovis in the top left corner, Fig. 2B), even as the absolute numbers of those genes are similar to what is found in other bacteria.

We are interested in the ways to determine a minimal genome size by simultaneous gene deletions. To study this problem, it would be sensible to experiment with the already-small bacterial or archaeal genomes. Perhaps surprisingly, the data for systematic deletions of k-tuples of genes are not generally available for prokaryotes, but they have been obtained for the larger genome of yeast S. cerevisiae. The number of k-tuples for a given value of k in a species with N genes is given by a binomial coefficient (Nk) , and a summation of the binomial coefficients over the values of k gives the total number of all possible k-tuples from 1 to k. Table 1 shows the values for the genomes with 500 genes (a small bacterial genome) and with 6,000 genes (a larger genome typical of yeasts and related fungi). The number of possible k-tuples grows rapidly with the increase of k and the increase of N, making the exhaustive examination of all k-tuple deletions plausible only for the small values of k.

TABLE 1.

The number of k-tuples for some values of k and N genes

k = 1 k = 2 k = 3
N = 500 500 124,750 20,708,500
N = 6,000 6,000 17,977,000 35,982,002,000

We model the lethal interactions data by representing the union of all lethal genes, lethal gene pairs, and lethal gene triples as a lethality hypergraph H(V,E). The vertex set of this hypergraph corresponds to all genes in a genome, in which each single gene v that is lethal when deleted corresponds to a self-loop at vertex v, each pair (v,w) that is lethal when jointly deleted defines an edge connecting vertices v and w, and each triple (v, w, u) lethal when jointly deleted defines a hyperedge formed by vertices v, w, and u. Finding a minimal gene set increment for a given value of k is equivalent to finding a minimum vertex cover in the hypergraph. For k = 1, a minimal gene set increment M1 is given directly by the genes which are lethal when deleted singly; for k = 2 and k = 3, the increments M2 and M3 must be derived from the synthetic lethality data. The size of a minimal gene set is obtained by summation of all Mi.

The distribution of vertices by the number of adjacent edges in YEAST-K2 (4,457 vertices, 318,757 edges) is fat-tailed, i.e., a small fraction of vertices has very large degrees and a large fraction of vertices has small degrees, with the average degree of 143 and the median degree of 90. To obtain the estimate of the minimal genome size from the YEAST-K2 graph, we modified the LJC algorithm. The complexity of the naïve implementation of the LJC algorithm on the hypergraph H(V,E) is O(|V| × |E|), where |V| is the total number of vertices and |E| is the sum of sizes of all hyperedges in the hypergraph (the size of the hyperedge is the number of vertices it contains).

We modified the algorithm so that instead of revisiting all vertices, it updates and records the degrees of vertices at each iteration. Although this approach needs more memory, its worst-case complexity is O(|V|+|E|). The pseudocode of the improved implementation is presented below.

A faster version of Lovász–Johnson–Chvatal algorithm LJC*

vertex cover ← empty set

degree ← an array of degrees of vertices in the hypergraph H

while (hyperedge set of the hypergraph is non-empty)

v ← a vertex with the maximal degree in the hypergraph

remove v from the vertex set

for each hyperedge e incident to v

remove e from the hyperedge set

for each vertex w from e

degree(w) ← degree(w)−1

add v to vertex cover

return vertex cover

We tested the running times of the naïve and modified algorithms when finding the minimum vertex cover of YEAST-K2. It takes LJC(H) ~224 seconds to find the minimum vertex cover on a 1.8 GHz Intel Core i5 laptop, while LJC* takes 6 s. In 50 independent runs, LJC* produced vertex covers M2 of YEAST-K2 with sizes 1,722 or 1,723. When added to M1 of ~1,350 [a mid-range value from the results in references (3234)], the total number of genes in the minimal genome approaches 3,100.

We next simulated 100 graphs with the same number of vertices and edges and with the similar distribution of the node degree probability to the YEAST-K2 graph. The probability p(v) of a vertex v in the YEAST-K2 graph is defined as the ratio of its degree to the sum of degrees of all vertices in this graph. For vertices vi and vj, an edge between them was assigned with probability p(vi vj) =p(vi)p(vj), and |E| random edges were sampled from the distribution. Table 2 shows the distribution of sizes of vertex covers for the set of 100 simulated graphs. As can be seen, the minimal genome increments simulated in this way are at least twice as large as M2.

TABLE 2.

The ranges of genome sizes obtained in 100 simulations

Genome sizes Outcomes Genome sizes Outcomes
3,715–3,720 2 3,741–3,745 17
3,721–3,725 3 3,746–3,750 16
3,726–3,730 10 3,751–3,755 11
3,731–3,735 15 3,756–3,760 4
3,736–3,740 19 3,761–3,765 3

At the next step, we augmented YEAST-K2 with the three-vertex hyperedges, generated as described in Materials and Methods. The YEAST-K3 hypergraphs were produced and their minimum vertex covers were computed using the LJC* algorithm. The size of minimal genome increment M3 was found to grow rapidly at first with the increase of f, but to slow down around the f value of 4 × 10−5, which corresponds to approximately twofold excess of triples over pairs (Fig. 3A). The zone of rapid growth in the minimal genome size (f between 10−7 and 10−5, corresponding to, respectively, 1,472 and 148,737 triples added to 318,757 pairs) was then studied at a higher resolution. We simulated lethal triples using three approaches; in one of them, each lethal triple was selected with a fixed probability 1/|V|3, where V is the number of vertices in the graph. In the other two approaches, the lethal triples were simulated by sampling from a probabilistic distribution derived from lethal pairs. We applied the LJC* algorithm and summed M1 , M2, and M3. Furthermore, we simulated 108,753 random triples to estimate the size of the minimal genome, using an estimated ratio of the lethal triples to lethal pairs, r of 0.34 (based on the data from reference 34, where 410,399 pairs and 195,666 triples of yeast genes have been deleted, and 9,363 pairs and 3,196 triples were lethal). The results of the simulations are shown in Fig. 3.

Fig 3.

Fig 3

Minimal genome size estimates based on the known YEAST-K2 graph combined with the simulated YEAST-K3 hypergraph. (A) Uniform sampling method. (B) The range of lower values of triples-to-pairs ratio from panel A.

It appears that when k is incremented by only a unit, the synthetic lethality effects increase the size of a minimal yeast genome by hundreds of genes. We expect that with the further increase in k, the sum of Mk will rapidly approach the size of the entire yeast genome. Similar estimates have been described in (34), where the data on k = 2 were used to parametrize a model, the lethal k-tuples were simulated for k from 0 to 10, and an unspecified greedy algorithm was applied to infer the minimal gene set. That study, similarly to ours, predicts that the size of the minimal genome should rapidly approach the total number of genes when the higher-order interactions are considered.

These estimates appear to be at odds with the results of the knowledge-based genome reduction experiments published for several species. In S. cerevisiae, elimination of one particular set of ~200 specific genes, i.e., ~5% of the genome, has resulted in a viable strain (41). In fission yeast Sch. pombe, a similar fraction of all genes has been removed without loss of viability (42). Large bacterial genomes of Escherichia coli and B. subtilis can survive after removal of specific sets of genes representing ~22% and ~41% of all genes in the genome, respectively (43, 44), and a derivative of a small bacterial genome of Mycoplasma mycoides can grow on the rich liquid medium when retaining only about one-half of all genes, i.e., 424 out of 901 (45). It is also quite likely that additional genes can be deleted from each of those genome-reduced strains, and it is also possible that not only one specific gene cohort, but some other gene sets of similar sizes may be deleted in each of those species without loss of viability.

Thus, the calculations reported in reference 34 and in this study are in contrast with the successful genome reduction experiments in multiple species. Indeed, for a set of 200 yeast genes, a summation of k-tuples for k ≤ 10 gives about 2 × 1016 tuples, and if the reported percentages of lethal k-tuples for k > 3 are similar to what is known for k = 2 and k = 3, one should expect an arbitrary group of 200 genes to include an astronomic number of lethal gene combinations. A question then arises: how could the genetic interaction landscape be navigated to allow for a significant genome reduction in the presence of pervasive negative genetic interactions?

One answer to this is that the hundreds of genes removed in the genome reduction studies described above have not been selected arbitrarily or randomly; in all those cases, the prior knowledge, or homology-based inference, of the metabolic pathways in the target species was used to identify the genes that are least likely to impact species’ survival on the rich medium (see references 4145 for more explanation).

It is also of interest to know whether the model of genetic interactions that we used here can be improved. For example, it is likely that some of the relevant aspects of the genetic interaction at k = 2 are not captured by our naïve model based on the global statistical properties of YEAST-K2 and simulation of YEAST-K3. To get closer to reality, one must improve the framework described here in order to allow for more realistic models of genetic interaction.

One factor to consider may be positive genetic interactions, i.e., the cases when the fitness of a multiply deleted strain is not lower, but higher than the expectation based on the phenotypes of the single deletions. Considering the case of two genes, a positive interaction between the pair of deletions means that if one or both genes are essential at k = 1, they nevertheless may be jointly eliminated from M1 ; this can be generalized for the higher values of k. Positive interactions between gene deletions are at least twice less common than negative interactions (18, 32), yet including them into the system may influence the outcome.

All told, the two main conclusions from this work are as follows. First, we show that the size of the minimal gene set increment for a given k is obtained by finding the minimum vertex cover in the synthetic lethality hypergraph, in which the vertices are genes and the hyperedges are the k-tuples of synthetic lethal gene interactions. Our modification of the LJC(H) algorithm, whose runtime depends essentially linearly on the number of edges in a graph, may be useful for finding minimum vertex covers in very large graphs with many nodes and edges. Second, we demonstrate that the version of a minimal gene set, estimated using the LJC* algorithm on the k-tuple deletions for k = 1, 2, and 3 in yeast S.cerevisiae, is quite large, comprising the majority of all genes in the genome. To understand the reasons for this contradiction with the results of the genome engineering experiments, further investigation is required.

On a different note, a minimal genome definition is the problem in the realm of genome engineering and is not expected to recapitulate any ancestral genomes or to recover evolutionary trajectories. Nonetheless, the results reported here may have indirect implications on the study of genome evolution. Gene loss and genome reduction are major, persistent factors in the evolution, especially in Bacteria and Archaea (4650). Multiple gene losses expand the space of possible negative gene interactions, and it would be interesting to model the genome evolution in order to understand how this space is navigated to avoid encountering synthetic lethality upon a loss of groups of genes.

ACKNOWLEDGMENTS

A.M. is grateful to Richard Durbin, to the members of Durbin lab at the Department of Genetics, and to Clare Hall College, University of Cambridge, for their hospitality and the opportunity to initiate this work.

A.M. was supported by the intramural Independent Research and Development and Long-Term Professional Development programs at the National Science Foundation. B.D.S. acknowledges support from the Wellcome Trust programme in Mathematical Genomics and Medicine (WT220023).

Contributor Information

Arcady Mushegian, Email: mushegian2@gmail.com.

Geraldine Butler, University College Dublin, Dublin, Ireland.

DATA AVAILABILITY

The synthetic lethality data, the YEAST-K2 list of edges, and Python code implementing the LJC(H) and LJC* algorithms are available at https://github.com/srahimia/minimal_vc.

SUPPLEMENTAL MATERIAL

The following material is available online at https://doi.org/10.1128/msphere.00139-24.

Supplemental Material. msphere.00139-24-s0001.pdf.

Table S1 and Fig. S1.

DOI: 10.1128/msphere.00139-24.SuF1

ASM does not own the copyrights to Supplemental Material that may be linked to, or accessed through, an article. The authors have granted ASM a non-exclusive, world-wide license to publish the Supplemental Material files. Please contact the corresponding author directly for reuse.

REFERENCES

  • 1. Koonin EV. 2000. How many genes can make a cell: the minimal-gene-set concept. Annu Rev Genomics Hum Genet 1:99–116. doi: 10.1146/annurev.genom.1.1.99 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Glass JI, Merryman C, Wise KS, Hutchison CA, Smith HO. 2017. Minimal cells—real and imagined. Cold Spring Harb Perspect Biol 9:a023861. doi: 10.1101/cshperspect.a023861 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Mushegian AR, Koonin EV. 1996. A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Proc Natl Acad Sci U S A 93:10268–10273. doi: 10.1073/pnas.93.19.10268 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Linard B, Ebersberger I, McGlynn SE, Glover N, Mochizuki T, Patricio M, Lecompte O, Nevers Y, Thomas PD, Gabaldón T, et al. 2021. Ten years of collaborative progress in the quest for orthologs. Mol Biol Evol 38:3033–3045. doi: 10.1093/molbev/msab098 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Koonin EV, Mushegian AR, Bork P. 1996. Non-orthologous gene displacement. Trends Genet 12:334–336. doi: 10.1016/0168-9525(96)20010-1 [DOI] [PubMed] [Google Scholar]
  • 6. Gil R, Silva FJ, Peretó J, Moya A. 2004. Determination of the core of a minimal bacterial gene set. Microbiol Mol Biol Rev 68:518–537, doi: 10.1128/MMBR.68.3.518-537.2004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Galperin MY. 2006. The minimal genome keeps growing. Environ Microbiol 8:569–573. doi: 10.1111/j.1462-2920.2006.01021.x [DOI] [PubMed] [Google Scholar]
  • 8. Altenhoff AM, Dessimoz C. 2009. Phylogenetic and functional assessment of orthologs inference projects and methods. PLoS Comput Biol 5:e1000262. doi: 10.1371/journal.pcbi.1000262 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Gabaldón T, Dessimoz C, Huxley-Jones J, Vilella AJ, Sonnhammer EL, Lewis S. 2009. Joining forces in the quest for orthologs. Genome Biol 10:403. doi: 10.1186/gb-2009-10-9-403 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Juhas M, Eberl L, Glass JI. 2011. Essence of life: essential genes of minimal genomes. Trends Cell Biol 21:562–568. doi: 10.1016/j.tcb.2011.07.005 [DOI] [PubMed] [Google Scholar]
  • 11. Glover N, Dessimoz C, Ebersberger I, Forslund SK, Gabaldón T, Huerta-Cepas J, Martin M-J, Muffato M, Patricio M, Pereira C, da Silva AS, Wang Y, Sonnhammer E, Thomas PD. 2019. Advances and applications in the quest for orthologs. Mol Biol Evol 36:2157–2164. doi: 10.1093/molbev/msz150 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Mushegian AR. 2007. Foundations of comparative genomics. Academic Press, Amsterdam; Boston. [Google Scholar]
  • 13. Omelchenko MV, Galperin MY, Wolf YI, Koonin EV. 2010. Non-homologous isofunctional enzymes: a systematic analysis of alternative solutions in enzyme evolution. Biol Direct 5:31. doi: 10.1186/1745-6150-5-31 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Itaya M. 1995. An estimation of minimal genome size required for life. FEBS Lett 362:257–260. doi: 10.1016/0014-5793(95)00233-y [DOI] [PubMed] [Google Scholar]
  • 15. Hutchison CA III, Peterson SN, Gill SR, Cline RT, White O, Fraser CM, Smith HO, Craig Venter J. 1999. Global transposon mutagenesis and a minimal mycoplasma genome. Science 286:2165–2169. doi: 10.1126/science.286.5447.2165 [DOI] [PubMed] [Google Scholar]
  • 16. Winzeler EA, Shoemaker DD, Astromoff A, Liang H, Anderson K, Andre B, Bangham R, Benito R, Boeke JD, Bussey H, et al. 1999. Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science 285:901–906. doi: 10.1126/science.285.5429.901 [DOI] [PubMed] [Google Scholar]
  • 17. Roemer T, Jiang B, Davison J, Ketela T, Veillette K, Breton A, Tandia F, Linteau A, Sillaots S, Marta C, Martel N, Veronneau S, Lemieux S, Kauffman S, Becker J, Storms R, Boone C, Bussey H. 2003. Large-scale essential gene identification in Candida albicans and applications to antifungal drug discovery. Mol Microbiol 50:167–181. doi: 10.1046/j.1365-2958.2003.03697.x [DOI] [PubMed] [Google Scholar]
  • 18. Tong AHY, Lesage G, Bader GD, Ding H, Xu H, Xin X, Young J, Berriz GF, Brost RL, Chang M, et al. 2004. Global mapping of the yeast genetic interaction network. Science 303:808–813. doi: 10.1126/science.1091317 [DOI] [PubMed] [Google Scholar]
  • 19. Joyce AR, Reed JL, White A, Edwards R, Osterman A, Baba T, Mori H, Lesely SA, Palsson BØ, Agarwalla S. 2006. Experimental and computational assessment of conditionally essential genes in Escherichia coli. J Bacteriol 188:8259–8271. doi: 10.1128/JB.00740-06 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Typas A, Nichols RJ, Siegele DA, Shales M, Collins SR, Lim B, Braberg H, Yamamoto N, Takeuchi R, Wanner BL, Mori H, Weissman JS, Krogan NJ, Gross CA. 2008. High-throughput, quantitative analyses of genetic interactions in E. coli. Nat Methods 5:781–787. doi: 10.1038/nmeth.1240 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Fels SR, Zane GM, Blake SM, Wall JD. 2013. Rapid transposon liquid enrichment sequencing (TnLE-seq) for gene fitness evaluation in underdeveloped bacterial systems. Appl Environ Microbiol 79:7510–7517. doi: 10.1128/AEM.02051-13 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Peng C, Lin Y, Luo H, Gao F. 2017. A comprehensive overview of online resources to identify and predict bacterial essential genes. Front Microbiol 8:2331. doi: 10.3389/fmicb.2017.02331 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Rancati G, Moffat J, Typas A, Pavelka N. 2018. Emerging and evolving concepts in gene essentiality. Nat Rev Genet 19:34–49. doi: 10.1038/nrg.2017.74 [DOI] [PubMed] [Google Scholar]
  • 24. Phillips PC, Johnson NA. 1998. The population genetics of synthetic lethals. Genetics 150:449–458. doi: 10.1093/genetics/150.1.449 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Welch JJ. 2004. Accumulating Dobzhansky-Muller incompatibilities: reconciling theory and data. Evolution 58:1145–1156. doi: 10.1111/j.0014-3820.2004.tb01695.x [DOI] [PubMed] [Google Scholar]
  • 26. Boone C, Bussey H, Andrews BJ. 2007. Exploring genetic interactions and networks with yeast. Nat Rev Genet 8:437–449. doi: 10.1038/nrg2085 [DOI] [PubMed] [Google Scholar]
  • 27. Phillips PC. 2008. Epistasis--the essential role of gene interactions in the structure and evolution of genetic systems. Nat Rev Genet 9:855–867. doi: 10.1038/nrg2452 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Roth FP, Lipshitz HD, Andrews BJ. 2009. Q&A: epistasis. J Biol 8:35. doi: 10.1186/jbiol144 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Turner LM, White MA, Tautz D, Payseur BA. 2014. Genomic networks of hybrid sterility. PLoS Genet 10:e1004162. doi: 10.1371/journal.pgen.1004162 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Dixon SJ, Costanzo M, Baryshnikova A, Andrews B, Boone C. 2009. Systematic mapping of genetic interaction networks. Annu Rev Genet 43:601–625. doi: 10.1146/annurev.genet.39.073003.114751 [DOI] [PubMed] [Google Scholar]
  • 31. Baryshnikova A, Costanzo M, Kim Y, Ding H, Koh J, Toufighi K, Youn J-Y, Ou J, San Luis B-J, Bandyopadhyay S, Hibbs M, Hess D, Gingras A-C, Bader GD, Troyanskaya OG, Brown GW, Andrews B, Boone C, Myers CL. 2010. Quantitative analysis of fitness and genetic interactions in yeast on a genome scale. Nat Methods 7:1017–1024. doi: 10.1038/nmeth.1534 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Costanzo M, Baryshnikova A, Bellay J, Kim Y, Spear ED, Sevier CS, Ding H, Koh JLY, Toufighi K, Mostafavi S, et al. 2010. The genetic landscape of a cell. Science 327:425–431. doi: 10.1126/science.1180823 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Costanzo M, VanderSluis B, Koch EN, Baryshnikova A, Pons C, Tan G, Wang W, Usaj M, Hanchard J, Lee SD, et al. 2016. A global genetic interaction network maps a wiring diagram of cellular function. Science 353:aaf1420. doi: 10.1126/science.aaf1420 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Kuzmin E, VanderSluis B, Wang W, Tan G, Deshpande R, Chen Y, Usaj M, Balint A, Mattiazzi Usaj M, van Leeuwen J, et al. 2018. Systematic analysis of complex genetic interactions. Science 360:eaao1729. doi: 10.1126/science.aao1729 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Kuzmin E, Rahman M, VanderSluis B, Costanzo M, Myers CL, Andrews BJ, Boone C. 2021. τ-SGA: synthetic genetic array analysis for systematically screening and quantifying trigenic interactions in yeast. Nat Protoc 16:1219–1250. doi: 10.1038/s41596-020-00456-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Mani R, St.Onge RP, Hartman JL, Giaever G, Roth FP. 2008. Defining genetic interaction. Proc Natl Acad Sci U S A 105:3461–3466. doi: 10.1073/pnas.0712255105 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Crona K, Gavryushkin A, Greene D, Beerenwinkel N. 2017. Inferring genetic interactions from comparative fitness data. Elife 6:e28629. doi: 10.7554/eLife.28629 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Barido-Sottani J, Chapman SD, Kosman E, Mushegian AR. 2019. Measuring similarity between gene interaction profiles. BMC Bioinformatics 20:435. doi: 10.1186/s12859-019-3024-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Garey MR, Johnson DS. 1979. Computers and intractability: a guide to the theory of NP-completeness. W. H. Freeman, New York. [Google Scholar]
  • 40. Chvatal V. 1979. A greedy heuristic for the set-covering problem. Math Oper Res 4:233–235. doi: 10.1287/moor.4.3.233 [DOI] [Google Scholar]
  • 41. Murakami K, Tao E, Ito Y, Sugiyama M, Kaneko Y, Harashima S, Sumiya T, Nakamura A, Nishizawa M. 2007. Large scale deletions in the Saccharomyces cerevisiae genome create strains with altered regulation of carbon metabolism. Appl Microbiol Biotechnol 75:589–597. doi: 10.1007/s00253-007-0859-2 [DOI] [PubMed] [Google Scholar]
  • 42. Sasaki M, Kumagai H, Takegawa K, Tohda H. 2013. Characterization of genome-reduced fission yeast strains. Nucleic Acids Res 41:5382–5399. doi: 10.1093/nar/gkt233 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Mizoguchi H, Sawano Y, Kato J, Mori H. 2008. Superpositioning of deletions promotes growth of Escherichia coli with a reduced genome. DNA Res 15:277–284. doi: 10.1093/dnares/dsn019 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Michalik S, Reder A, Richts B, Faßhauer P, Mäder U, Pedreira T, Poehlein A, van Heel AJ, van Tilburg AY, Altenbuchner J, Klewing A, Reuß DR, Daniel R, Commichau FM, Kuipers OP, Hamoen LW, Völker U, Stülke J. 2021. The Bacillus subtilis minimal genome compendium. ACS Synth Biol 10:2767–2771. doi: 10.1021/acssynbio.1c00339 [DOI] [PubMed] [Google Scholar]
  • 45. Hutchison CA, Chuang R-Y, Noskov VN, Assad-Garcia N, Deerinck TJ, Ellisman MH, Gill J, Kannan K, Karas BJ, Ma L, Pelletier JF, Qi Z-Q, Richter RA, Strychalski EA, Sun L, Suzuki Y, Tsvetanova B, Wise KS, Smith HO, Glass JI, Merryman C, Gibson DG, Venter JC. 2016. Design and synthesis of a minimal bacterial genome. Science 351:aad6253. doi: 10.1126/science.aad6253 [DOI] [PubMed] [Google Scholar]
  • 46. Wolf YI, Rogozin IB, Grishin NV, Koonin EV. 2002. Genome trees and the tree of life. Trends Genet 18:472–479. doi: 10.1016/S0168-9525(02)02744-0 [DOI] [PubMed] [Google Scholar]
  • 47. Koonin EV. 2003. Comparative genomics, minimal gene-sets and the last universal common ancestor. Nat Rev Microbiol 1:127–136. doi: 10.1038/nrmicro751 [DOI] [PubMed] [Google Scholar]
  • 48. Mirkin BG, Fenner TI, Galperin MY, Koonin EV. 2003. Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes. BMC Evol Biol 3:2. doi: 10.1186/1471-2148-3-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Demuth JP, Hahn MW. 2009. The life and death of gene families. Bioessays 31:29–39. doi: 10.1002/bies.080085 [DOI] [PubMed] [Google Scholar]
  • 50. Albalat R, Cañestro C. 2016. Evolution by gene loss. Nat Rev Genet 17:379–391. doi: 10.1038/nrg.2016.39 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material. msphere.00139-24-s0001.pdf.

Table S1 and Fig. S1.

DOI: 10.1128/msphere.00139-24.SuF1

Data Availability Statement

The synthetic lethality data, the YEAST-K2 list of edges, and Python code implementing the LJC(H) and LJC* algorithms are available at https://github.com/srahimia/minimal_vc.


Articles from mSphere are provided here courtesy of American Society for Microbiology (ASM)

RESOURCES