Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2008 Jun 11;105(24):8238–8243. doi: 10.1073/pnas.0710274105

Casting polymer nets to optimize noisy molecular codes

Tsvi Tlusty 1,*
PMCID: PMC2448821  PMID: 18550822

Abstract

Life relies on the efficient performance of molecular codes, which relate symbols and meanings via error-prone molecular recognition. We describe how optimizing a code to withstand the impact of molecular recognition noise may be understood from the statistics of a two-dimensional network made of polymers. The noisy code is defined by partitioning the space of symbols into regions according to their meanings. The “polymers” are the boundaries between these regions, and their statistics define the cost and the quality of the noisy code. When the parameters that control the cost–quality balance are varied, the polymer network undergoes a transition, where the number of encoded meanings rises discontinuously. Effects of population dynamics on the evolution of molecular codes are discussed.

Keywords: biochemical networks, information theory, polymer networks, rate-distortion theory


In the living cell, information is carried by molecules. The outside environment and the biochemical circuitry of the cell churn out fluxes of molecular information that are read, processed, and then stored in memory by other molecules. The cell's information-processing networks often need to translate a symbol written in one class of molecules into another symbol written in a different molecular language. This requires a code-table that translates between the two molecular languages. Perhaps the best-known example is the genetic code-table that translates 64 DNA base triplets into 20 amino acids (1, 2). One may think of such a code-table as a mapping—a probabilistic one because of the inherent noise—between the space of molecular symbols, e.g., the base triplets, and the space of molecular meanings, e.g., amino acids. The notion of mapping between two molecular spaces occurs also in biological codes of a much larger scale; for example, the transcriptional regulatory network that controls gene expression by DNA-binding proteins. This network may be seen as a mapping from the space of regulatory proteins to the space of their respective DNA binding sites. Evolution poses the organism with a semantic challenge: its code-tables must assign meanings to symbols in a manner that minimizes the impact of the molecular recognition errors while keeping down the cost of resources that the code-table necessitates. The present work describes a treatment of this biological optimization problem in terms of the statistical mechanics of polymer networks.

In the biophysical reality of the cell, actual polymer networks are essential for structural stability and motility (3, 4). However in the present context of coding, polymer networks are just mathematical entities that prove useful for describing the code-table and studying its optimization: Molecular recognition is inherently noisy because it involves energy scales that are not much larger than the typical thermal energy kBT. To reflect these recognition errors, the space of symbols is depicted as a graph in which symbols are vertices and edges connect vertices that are likely to be confused by misreading (Fig. 1). A code-table is then constructed by assigning meanings to each vertex. This can be pictured as coloring the vertices according to their meaning (2), which partitions the graph into “islands of meanings.” The boundaries between these islands form a network, which can be likened to a self-assembling network of polymers or self-avoiding random walks (Fig. 1). Polymer networks are natural in this context because they are related to the notion of space partitioning that is central to coding theory (5). But the resemblance to a polymer network is not merely structural. Optimizing the fitness of the mapping is shown to be equivalent to minimizing the free energy of the polymer network. Such an optimal mapping must balance the three conflicting needs for maximal error tolerance, for maximal diversity, and for minimal cost. During evolution, the code-table adapts by altering the network in response to changes in the equipoise of these three evolutionary forces.

Fig. 1.

Fig. 1.

The code-table as an information channel and the relation to polymer networks. The code-table is an information channel or mapping that relates a space of meanings (Left), which are depicted as colors, to a symbol space (Center). The symbol space is a graph, hexagonal in this example, in which vertices are symbols and edges connect symbols that are likely to be confused by reading. The code-table induces a coloring pattern on the symbol space, which divides it into meaning islands (dotted lines). In the triangular dual graph (Right), the boundaries between meaning islands form a network of “polymers” (thick solid lines).

In the present work, we first discuss how the partition of the symbol space by the polymer networks determines the fitness of the code-table, where mathematical definitions of fitness and its determinants, error-load, diversity, and cost are given. Next, we discuss one purpose of the present work, which is to show that the fitness function of the code-table corresponds to the free energy of the polymer network. Thus, it suggests that the problem of optimizing the code-table is equivalent to calculating the equilibrium statistics of the polymer network. A second purpose is then to use this equivalence to examine the code optimization problem in parameter regimes that are hard to access otherwise. In particular, it is used to identify a first-order transition, in which the number of meaning islands changes abruptly in response to varying the error-tolerance–diversity–cost balance. Finally, to put the model within a context of population dynamics, we discuss possible aspects of metastability, mutations, and genetic drift that may affect the evolution of the code.

Model and Results

Fitness of Molecular Codes.

Information in the cell is recorded in molecules and then retrieved and translated into other molecules by a code-table. To discuss how the organism optimizes such a code-table, we represent the table as an information channel or a mapping that relates a meaning space, in which nm meanings reside, to a symbol space, which contains ns symbols (Fig. 1). The code-table maps (or encodes) meanings to symbols and thus partitions the space of symbols into meaning islands whose boundaries form a polymer network. In this section, we first calculate the fitness of a code-table as a function of the mapping as specified by the polymer network. The code fitness is composed of contributions due to the diversity of the code-table, its average error load, and the cost of constructing the molecular coding apparatus. There are two sources of noise in our simple description of the coding system, noise while reading and noise in the mapping itself. Recognition errors while reading a symbol are represented by the symbol graph whose edges connect symbols that misreading may confuse. The impact of this noise is included in the error-load component of the fitness. Noise in the mapping, which is implemented in error-prone molecular recognizers, is represented as a statistical average over an ensemble of polymer networks. This second recognition problem requires additional resources and is included in the cost (which together with the error load is defined below).

A code-table does not necessarily map all of the nm available meanings to symbols but may possibly map only nfnm out of them. The larger is the number of encoded meanings nf, the more diverse is the code. Diversity contributes to the fitness of the code because it increases the chance that when the organism needs to read, write, or process a certain meaning it can accurately encode it as one of the available symbols. However, diversity of meanings also increases the probability for misreading errors: To reconstruct the meaning encoded by a certain symbol, the organism has to read it. Because the molecular reading apparatus is not perfect, it may sometimes confuse this symbol with one of its neighbors in the graph (Fig. 1). If many meanings are encoded then, on average, the meaning islands defined by the network are smaller. In this finer network, the chance of confusing symbols of different meanings is larger, which costs the organism in a higher error load.

To find the error load, one needs to specify a partition into meaning islands and examine the average chance to cross by misreading the boundaries between the islands. We specify a partition (see Fig. 1) by assigning to each edge i–j a binary variable Eij that indicates whether the edge is on the boundary between two meaning islands (Eij = 1) or inside of an island (Eij = 0). Misreading along the edge, which confuses i with j, occurs with a probability rij, while rii is the probability to correctly read i. There may be two possible outcomes of misreading: If both symbols are in the same island and are therefore synonymous, then misreading bears no load because the translated meaning does not change. If the symbols reside in two different islands and are therefore nonsynonymous, then the fitness of the organism decreases by one fitness unit. The contribution of an i–j misreading to the error load can be therefore quantified as rijEij and the total error load is a sum over all edges, Σij rijEij.

The need for diversity is an evolutionary force that counteracts the need to minimize error load. Our model assumes for simplicity that the contribution of diversity to the fitness is linear in nf, the number of encoded meanings. The quality HE, of a given network pattern E, is then a linear combination of the error load and the diversity,

graphic file with name zpq02408-3328-m01.jpg

where the parameter wD measures the significance of diversity relative to error load. We use a sign convention in which an optimal code of high quality corresponds to low values of HE. The quality depends on the coloring pattern of the code-table, which determines its error load and diversity. As illustrated in Fig. 1, each coloring pattern is determined by the network of boundaries between the islands, which is equivalent to a network of polymers. The quality is governed by the interplay between error load and diversity: If the reader were ideal rij = δij, then it would have been advantageous to decode as many meanings as there are available symbols, nf = ns. However, because the molecular reader is not perfect, it is preferable to decode fewer meanings to minimize the effect of misreading errors. The quality HE corresponds to a “microstate” specified by a deterministic network configuration E. The stochastic mapping of the molecular code is a “macrostate,” which is represented by an ensemble of such configurations and is calculated below. The code quality HC is defined as the ensemble average of quality over all microstates, HC = 〈HE〉.

Besides the quality of the code, which combines its error load and diversity, the code fitness must also account for the cost of the molecular machinery that performs the mapping. Molecular codes are physically implemented by recognition interactions between the meanings, the symbols, and sometimes other intermediary molecules, such as the tRNA in the genetic code (1). High specificity of recognition improves the quality of the code because it enables more accurate mapping. However, highly specific binding requires a higher binding energy, which in general necessitates larger binding sites. It is plausible to assume that the cost of the code is proportional to the average size of the binding sites and therefore to the average binding energy (6). This is because the cost of synthesizing the molecules and maintaining their genes is proportional to their size. To estimate the cost, one notes that the binding probability scales like the Boltzmann exponent of the binding energy (in units of kBT), Pb ∼ exp(Eb). It follows that the cost, which is equal to the average binding energy 〈Eb〉, can be approximated by the average 〈ln Pb〉, which is minus the entropy of the mapping of meanings to symbols, −SC (see Methods). This entropy averages over the ensemble of all possible mappings, which is determined by all possible networks and the number of possible ways to color every such network.

Finally, the overall fitness of the code FC is estimated by a weighted sum of the quality and the cost, FC = HCwC × SC, where the parameter wC measures the significance of the cost with respect to quality. FC is like a free energy of all possible colorings of the code-table and may be derived from the partition function ZC,

graphic file with name zpq02408-3328-m02.jpg

Within this analogy, the quality HE plays the role of energy, and wC is an effective temperature. The ensemble average in ZC is due to the probabilistic nature of the molecular mapping. At high wC, codes are fuzzy and smeared over many network configurations, whereas at low wC, they are sharper because the mapping is almost deterministic. In principle, one can derive the code-table by performing the summation in Eq. 2 (6), but in practice this is a burdensome task that can be performed only numerically and even this only for codes of limited size. Tractable analytic results exist mostly at the limit of high wC. In this regime, the code table undergoes a second-order phase transition from a noncoding state of no correlation between meanings and symbols to a coding state, in which such correlation has just emerged (2, 68).

The cost–quality balance of the code-table is analogous to the balance in an engineered noisy information channel between the average distortion in the channel H, which measures its quality, and the channel's rate I, which measures the cost of the channel by estimating how many bits are required to encode one meaning. Rate-distortion theory (9) focuses on the fundamental problem of optimizing a noisy information channel, which can be formalized as the following question: What is the minimal rate I required to assure that the distortion in the channel will be less than a certain desired value H? This optimal rate-distortion curve is calculated by minimizing a functional F = H + wCI, where H, I, and F are analogues of HC, −SC, and FC, respectively. The “temperature” wC = −∂H/∂I, the slope of the optimal curve, measures the increase in quality due to an additional bit of information. In the biological context, wC is expected to decrease with the complexity the organism and its environment: A complex organism transmits more information. It is therefore in the interest of this organism to pay a larger cost to improve the quality of its codes, and wC = ∂HC/∂SC is lower. Similarly, a richer environment is also “colder.” At low wC, the quality HC dominates the free energy FC and the code-table tends to one of the many minima of HE. Derivation of the optimal code in this regime is difficult, even numerically, because of the rugged landscape of HE. As we discuss below, the polymer network formalism offers insight into this regime and, specifically, suggests a first-order coding transition.

Statistics of Polymer Networks on the Dual Symbol Graph.

To formalize the analogy of molecular codes to polymer networks, we need to examine the dual of the symbol graph on which the network resides. To find the dual graph, one embeds the symbol graph in a surface (10). In the example of Fig. 1, the symbol graph is a hexagonal lattice that is embedded in a torus and the dual is a triangular lattice (see Methods). It is evident that the vertex-coloring pattern of the symbol graph corresponds to a connected “polymer” network whose monomers are a subset of the edges of the dual graph (Fig. 1).

By counting the number of edges and vertices in the polymer network one can derive the number of meaning islands nf in the quality HE (Eq. 1). For this purpose, we introduce another binary variable, Vi, that indicates whether a vertex i of the dual graph is part of the polymer network (Vi = 1) or not (Vi = 0). Then, the numbers of occupied edges ne, occupied vertices nv, and islands nf are related through the definition of Euler's characteristic χ = nvne + nf (10),

graphic file with name zpq02408-3328-m03.jpg

χ is determined by the topology of the surface in which the symbol graph is embedded; for example, χ = 0 for the torus in Fig. 1. By substitution of Eq. 3 into Eq. 1, we find that the code quality is HE = Σij(rijwD)Eij + wDΣiViwDχ, and the code's partition function is therefore ZC = ΣE,VNEV exp(−HE/wC), where the summation is over all valid network configurations. The factor NEV is the number of possible ways to color a given pattern specified by the fields E and V. As is common in biological codes, it is assumed henceforth that there are many more meanings than available symbols, nmnsnf, so the combinatorial factor can be estimated as NEV ≈ exp(nf ln nm). By substituting into ZC the approximated NEV with the island number nf taken from Eq. 3, the partition function becomes

graphic file with name zpq02408-3328-m04.jpg

with the coefficients α = wD + wC ln nm and βij = (rijwD) − wC ln nm.

The code partition function ZC (Eq. 4) sums over all possible networks. The building blocks of these networks are self-avoiding polymers that fuse to each other at junctions. Within the analogy to physical networks, Eq. 4 may appear as a summation over two chemical “species”—one that resides on the edges with “excitation energies” (or chemical potentials) βij, and one on the vertices with the excitation energy α. However, these two species are not really independent and cannot be summed separately. A vertex is occupied if and only if its coordination number is at least two because “dangling ends” or isolated vertices are forbidden. Similarly, an edge is occupied if and only if it connects two occupied vertices. The relevant chemical species are the monomers, i.e., pairs of a vertex and a neighboring edge that carry energies of α + βij and the possible k-fold junctions (k > 2). The formation of a k-junction replaces k ends that contribute each an energy of α/2 by one vertex of energy α so that the overall energy change is (1 − k/2)α. Performing the summation over all possible networks proves to be tricky because the vertex and edge occupations are not independent. Below, we employ the n = 0 formalism to resolve this configuration counting problem.

Correspondence Between Code Optimization and Spin Networks.

The n = 0 formalism was devised by de Gennes to examine polymer solutions (11, 12). Recently, it was applied also to microemulsions, micellar solutions, and dipolar fluids (1315). At the basis of this approach is a mathematical equivalence between a system of self-excluding polymers and a system of interacting n-component magnets, in the limit of vanishing number of components, n = 0. The n = 0 formalism is reviewed elsewhere (12, 13). Here, we only discuss concisely the basic idea and use this approach to show that the statistics of the code-table (Eqs. 2 and 4) can be mapped to that of the zero-component magnets.

To demonstrate the equivalence between the code-table problem and the spin system, let us consider the dual of the symbol graph on which the polymer network resides (Fig. 1) and assign to each of the edges ij an n-component magnetic spin Sij. The interaction is represented by a spin-Hamiltonian HS and the spin partition function is ZS = 〈exp(−HS)〉, where 〈 … 〉 denotes the average over all possible spin orientations. A peculiar feature of this system in the limit of zero components (“the n = 0 property”) is that all averages over products of spins vanish except for the quadratic averages 〈Sij2〉 = 1, where Sij is any of the components of Sij. This property enables mapping of the spin lattice to the network ensemble by tailoring a spin Hamiltonian HS and a consequent spin partition-function ZS, in which an 〈Sij2〉 term appears if and only if the corresponding edge i–j is occupied in the network partition function ZC. It is shown below that this correspondence is accomplished by the Hamiltonian HS,

graphic file with name zpq02408-3328-m05.jpg

where j(i) are all of the neighbors of i, and the coefficients ai and bij are related to the α and βij parameters (Eq. 4). The functions hi are the contributions of each vertex i to HS and consist of all possible products of two or more edges around i. Each of these products corresponds to a possible edge occupation state in a network (Fig. 2). The one- and zero-edge configurations are forbidden in the network and are therefore subtracted from hi (the second and third terms).

Fig. 2.

Fig. 2.

Correspondence of the polymer networks to n = 0 spin systems. The solid lines denote boundaries between the meaning islands induced by the code on the dual graph (Fig. 1 Right). In the spin model, to each edge a spin Sij is assigned. Each vertex i contributes to the spin Hamiltonian HS a factor hi, which accounts for all possible edge occupancies around this vertex. By the construction hi (Eq. 5), if a vertex is occupied then at least two of the adjacent edges are occupied. In the present example, a four-junction at vertex 1 (red), which corresponds to a factor a1b13S13b14S14b16S16b17S17, connects to three linear elements (magenta), e.g., a7b71S71b79S79, and one three-junction (green), a3b31S31b32S32b38S38. The corresponding contribution to the spin partition function ZS is an average over all of the spin orientations. This contribution does not vanish because each spin appears exactly twice in the product because Sij appears exactly once in both edge configurations of i and j. The weight of this contribution is the product of the bij-s and ai-s for each edge and vertex in the product.

How this form of the Hamiltonian HS ensures the equivalence is clear by expanding ZS in a power series, ZS = 〈exp(−HS)〉 = 〈∏i exp(hi)〉 = 〈∏i(1 + hi + hi2/2 + …)〉. Because of the n = 0 property, the infinite series can be exactly truncated at the second term and we obtain ZS = 〈∏i(1 + hi)〉, which is a sum over averages of spin products. In this sum, the only nonvanishing terms are those in which each spin Sij appears exactly twice. In those configurations, Sij must appear in both contributions of hi and hj. As illustrated in Fig. 2, it is evident that such a spin configuration corresponds to a network configuration and the weight of this term in the partition function is a product of the bij-s and the ai-s of the occupied edges and vertices. From all of this, we find that the spin partition function is

graphic file with name zpq02408-3328-m06.jpg

with the same occupation variables Eij and Vi that are used to count network configurations in the code partition function ZC (Eq. 4). Finally, to obtain the one-to-one correspondence one needs to identify the “fugacities” in Eqs. 4 and 6, bij2 = exp(−βij/wC) and ai = exp(−α/wC), which results in identical partition functions, ZC = ZS (up to an irrelevant factor), and free energies, FC = FS. In the following, this correspondence is used to gain insight into the noisy coding system from a mean-field solution of the spin system.

Mean-Field Solution of the n = 0 Spin System.

To solve for the optimum of a noisy molecular code, we employ a standard variational mean-field technique (1315). We approximate the spin probability distribution by a product of independent single-spin distributions, ρ = ∏ijρij(Sij), which is used to construct a variational least upper bound on the free energy of the system FS (see Methods). By this mean-field procedure, it is straightforward to find that the average spin, sij = 〈Sijρij(Sij)〉, is given by the relation

graphic file with name zpq02408-3328-m07.jpg

where gij are effective fields that involve the spins on neighboring edges, gij = aibij[∏k(i)≠j(1 + biksik) + ∏l(j)≠i(1 + bjlsjl) − 2]. In a similar fashion, we obtain the mean-field free energy, FS = −Σihi + Σij [gijsij − ln(1 + gij2/2)]. The first term in FS is the average Hamiltonian, which is the quality of the average code, while the last term is entropic and accounts for the cost. The self-consistency relations (Eq. 7) are polynomial equations in the average spins, which link every spin sij only to the spins on the neighboring edges. Although in the general case a solution is obtained only numerically, it is much simpler to solve than the typical rate-distortion expression (e.g., ref. 6). More importantly, it provides insight into the “low-temperature” (low wC) regime where the landscape of the code's free energy is rugged and therefore hard to calculate.

To make use of the equivalence between the spin system and the coding network, we need to express the average network occupancies, eij = 〈Eij〉 and vi = 〈Vi〉 as a function of the average spin sij. The average edge-occupancy is given by eij = (1/2)∂ ln ZS/∂ ln bij = gijsij/2, a consequence of Eqs. 57. Likewise, the average vertex-occupancy is vi = ∂ ln ZS/∂ ln ai = hi (see Methods). Thus, one can calculate the average network configuration (that is the average code) for any value of the fugacities ai and bij, or for the equivalent control parameters of the coding system, wD, wC, rij, and nm. This is demonstrated below, where a first-order “coding transition” is deduced from the spin formalism.

Mean-field models similar to the one used here are standard in n = 0 treatments of self-assembling systems, such as polymer and micellar solutions (14, 15) and networks (13). The basic idea of the mean-field approach is to replace the spin–spin interactions by an interaction of a single spin with an effective field (the gij polynomials). This procedure vastly simplifies the problem and enables a relatively simple solution. However, this simplicity comes at the cost of disregarding the long-range spatial correlations between the spins and the corresponding correlations between the edges and the vertices. This implies, for example, that one can estimate the mean connectivity in the network but cannot tell how many loops it contains. In general, a mean-field treatment merely approximates qualitatively the behavior of thermodynamic functions. However, the accuracy of this approximation improves when each spin interacts with many neighbors. Therefore, when the symbol graph is highly connected—such as the graph of the genetic code, where each codon has nine neighbors—the mean-field approximation is expected to be relatively accurate and provides a basis to more elaborate models.

A First-Order Coding Transition.

The equivalence of the spin system and the code-table enables us to follow the evolution of the code-table in response to variation of the control parameters that govern its optimization: the cost weight wC, the misreading matrix rij, the diversity weight wD, and the number of meanings nm. These parameters are not independent but related through the spin fugacities, a and bij, or the equivalent edge and vertex energies, α and βij. It proves convenient to represent these relations in terms of the normalized diversity, D = wD/wC + ln nm = −ln a = α/wC, and the normalized misreading probabilities, Rij = rij/wC = −ln(abij2) = (α + βij)/wC.

To examine the response of the code-table to variation of the four control parameters, we consider, for the sake of simplicity, regular symbol graphs, in which all of the vertices and edges are equivalent. Regular symbol graphs are useful in biological context. For example, a regular graph may approximate the symbol graph of the genetic code, where each of the 64 codons has 9 neighbors (2). Regular graphs may also describe large symbol spaces, for example the space of DNA binding sites of the transcription system, whose structure is not exactly known but whose average coordination number q is well determined. Regularity of the symbol graph implies uniform misreading rij = r and, as a result, homogenous average spin sij = s, and occupancies, eij = e and vi = v. Because of symmetry, we need to solve a single self-consistence relation s = g(1 + g2/2)−1 (Eq. 7) with g = 2ab((1 + bs)q−1 − 1). The free energy per vertex of the dual graph is given by f = −h + (q/2)gs + (q/2)ln(1 + g2/2), where h = a((1 + bs)qqbs − 1) and q is the coordination number.

The resulting phase diagram of the regular symbol graph exhibits a line of first-order transitions, where the number of encoded meanings jumps discontinuously from nf = 1, with a mapping that encodes a sole meaning, to a number nf > 1 that scales extensively with the size of the symbol graph ns (Fig. 3). The state nf = 1 is termed noncoding, because the code-table in this state conveys no information because only one symbol is used. When a coding state, nf > 1, emerges the coding system is capable of conveying information at a rate of log2 nf bits/symbol. Tracing the behavior of the free energy f as the scaled misreading R is varied (Fig. 3A), we find that at high R the system is at the noncoding, no-network state, as manifested in the profile of f by a global minimum at s = e = v = 0. This is because the system prefers to reduce the impact of misreading errors, which are too costly at a high R, at the expense of diversity. As R decreases, the system reaches the first-order transition, which occurs after a second, coding state minimum, sC ≠ 0, emerges and exhibits f(sC) = f(0) = 0.

Fig. 3.

Fig. 3.

The free energy and phase diagram of the code-table. (A) Free energy f (Upper) and the edge occupancy e (Lower) of the regular hexagonal symbol graph (Fig. 1) at scaled diversity D = 3.0 and scaled misreading R = 3.2, 3.0, 2.6, 2.3 (legend). All of the curves exhibit an extremum at the no-network state s = 0. At the coloring transition (green curve), the second minimum that corresponds to the network state sC is at f(sC) = f(0) = 0. At lower values of R, the network state becomes the global minimum. At R = (2q − 2) the no-network state is destabilized (black). The dashed curve traces the loci of the network state as R varies. These loci are found at the intersection of the edge density e = gs/2 and the ellipse 2s2 + (2e − 1)2 = 1 (Lower). (B) Phase diagram of the coding (network ↔ no network) transition (solid line) in the D–R plane. The dashed line bounds the region of metastability, beyond which the no-network state is destabilized. (C) Vertex occupancy v, edge occupancy e, and ratio meanings/symbol nf/ns along the transition line (see text).

In the D–R plane (Fig. 3B), the phase transition line approaches a straight line, R ≈ (1 − 2/q)D, which corresponds to a q-fold junction whose “energy” equals the thermal energy wC, α + (q/2)β ≈ wC. In other words, the system undergoes a phase transition when q-junctions become thermally excitable. Because q-junctions are the majority species, the emergent network is dense and highly connected. The transition line indicates various pathways that the system can take toward the formation of a network at a coding state: increasing the number of available meanings nm, increasing the diversity parameter wD, and decreasing the misreading r.

At the transition, the number of meaning islands nf jumps abruptly and becomes proportional to the number of symbols (Fig. 3C). nf is given by Euler's characteristic (Eq. 3) with the average edge occupancy e = gs/2 and vertex occupancy v = h. This implies that the number of meanings per symbol is nf/ns = χ/ns + (p/2)e − (p/q)v, where p is the coordination number in the symbol graph and q in the dual graph. The curves of the vertex and edge occupancies approach a common high D limit, e, ve0 = a(bs)q, where the resulting meanings/symbol ratio is nf/ns = p(1/2 − 1/q)e0. For planar regular graphs, 1/p + 1/q = 1/2 and nf/ns = e0. In Discussion, we examine possible effects of population dynamics on the evolution of the code.

Discussion

The mean-field solution allows us to draw an approximate fitness landscape where the evolution of the code takes place. In general, this code fitness landscape FS(sij) (see Eq. 6) resides in a high-dimensional code space, whose coordinates are the average spins sij or their conjugates, the average edge occupancies eij. Each point in this space is a vector s = (sij …) of dimension equal to the number of edges, which represents a possible code. Symmetry may reduce this dimensionality, and for the regular symbol graph it becomes one-dimensional f(s) (Fig. 3). We imagine a population of “organisms,” simple information-processing systems, which compete according to the fitness of their codes. Each organism is depicted as a point in code space positioned at its code s, and the population is described by a probability density Ψ(s). The preceding discussion assumed implicitly that, as the control-parameters change, the evolution of the code more or less follows the track of the optimum in code space. In other words, the population density is a delta-distribution located at the optimal FS. We conclude this work by considering several more realistic scenarios. First, we discuss the possibility that the coding system is stuck at metastable suboptimal states. Then, we consider mutations and genetic drift that may broaden the population toward codes of less optimal fitness.

Metastable Suboptimal Codes.

Even when the global optimum in the code fitness is at a network state, sC ≠ 0, the no-network state, s = 0, may remain locally stable for some parameter range (Fig. 3B). To locate the metastable state, one examines the curvature of the free energy at its s = 0 extremum, which in the case of an isotropic symbol graph is f″ = ∂2f/∂s2 ∼ 1 − 2(q − 1)eR. A metastable state exists as long as this curvature is positive. We find that the curvature changes its sign at the line R = ln(2q − 2), which corresponds to a “monomer” (a vertex plus an edge) of energy α + β = wC ln(2q − 2). This may be further clarified by considering the limit of the vertex/edge occupancy ratio near the no-network state, v/eq/2. Because there are q/2 edges per vertex, this indicates that the dominant building block of the dilute network is the monomer. Thus, a coding system becomes unstable exactly when the monomers become “thermally excitable,” compared to the q-junctions that are excitable at the coding transition.

Effects of Mutations and Genetic Drift in the Codes Space.

So far, it was assumed that the population of organisms is sharply concentrated around a certain optimal or metastable, suboptimal code. This scenario applies to large populations at negligible mutation rates μ. Let us consider two possible effects of population dynamics that may smear the population over the code space, mutations and genetic drift. These effects were analyzed in detail within the framework of rate-distortion theory (6) and are discussed here only schematically, in the context of the present polymer network model.

We consider a population that is localized around a fitness optimum f0 at an optimal code s0, where the landscape is approximately f(s) ≈ f0 + ½f″(ss0)2 [a regular graph with a one-dimensional fitness f(s) is assumed for simplicity]. Mutations drive the organisms to diffuse into somewhat less optimal regions of the code space. This effect may be described in terms of reaction-diffusion dynamics of the probability distribution (6), ∂tΨ = μ·∂s2Ψ − f(s)Ψ, where the first term is due to mutations and the second represents reproduction at a rate −f(s). It is straightforward to find that this dynamics tends to a steady state, in which the population is broadened into a Gaussian, Ψ(s) ∼ exp[−(f″/2μ)1/2(ss0)2] of width that scales like ∼(μ/f″)1/4 (6). It implies that the width of a population at a metastable state (s = 0) will diverge near the metastability limit (f″ = 0) just before the Gaussian migrates to the coding state (sC ≠ 0).

When the effective population size N is relatively small, fluctuations in the reproduction rate, termed genetic drift, become significant. In this regime, the dynamics is characterized by long periods when the population is localized around a fitness optimum, which are separated by fast diffusive migrations to new optima (6). The dynamics of the distribution is known to reach a Boltzmann partition, Ψ(s) ∼ exp(−Nf(s)), where the population size plays the role of an inverse temperature. Relatively small populations (which are “hot” in this sense) are expected to be partitioned by genetic drift between the available fitness optima. For example, the two minima of the free energy f(s) (Fig. 3) will be populated according to their fitness values.

In the previous sections, we have shown that each organism experiences “internal” noise due to stochastic molecular recognition in its coding system. The internal noise affects the fitness of the code through the error load and the cost (16). On top of this, mutations and genetic drift add “external” sources of noise, which may drive parts of the population away from the optimal code. The existence of metastable states may further delay the transition to a coding state. The present model and its conclusions suggest that the n = 0 polymer network formalism is a potential tool to study several other aspects of noisy coding systems.

Methods

Cost of a Code-Table.

The cost of a code-table is traditionally measured by the mutual information I between the symbols and the meanings that they encode, I = SME + SSYSMS, where SME and SSY are the entropies of the meaning space and symbol space, respectively, and SMS is the joint entropy of these two spaces. The entropy of meanings SME is determined by their given distribution Pα, SME = −ΣαPα ln Pα, and similarly, SSY = −ΣiPi ln Pi = ln(ns). These are constant terms in the cost I that we can neglect, and consider only the joint entropy SMS, which can be optimized by tuning the average partition pattern eij. SMS is simply the entropy of all of the possible coloring patterns as determined by all possible networks and the number of possible ways to color every such pattern. It follows that the cost is therefore minus the coloring entropy I = −SMS = −SC.

Graph Embedding and the Dual Graph.

The embedded graph divides the surface into faces or cells, hexagons in our example (Fig. 1). Then, one finds the dual graph by the following correspondence (10): Every vertex in the symbol graph corresponds to a cell in the dual (a triangle in this example) whereas every cell in the symbol graph (a hexagon) corresponds to a vertex of the dual. The correspondence between the edges in the symbol graph and its dual is one-to-one; every edge corresponds to the edge that crosses it in the dual. The resulting dual graph is a triangular lattice. The hexagonal lattice is a regular graph in which all of the vertices have the same coordination number. However, the embedding procedure described here applies to any connected graph whether it is regular or not.

Mean-Field Approximation.

The spin probability distribution decouples into a product of independent single-spin distributions, ρ = ∏ijρij(Sij). We use a variational inequality, which sets an upper limit on the spin free energy, FSFM = 〈ρHS〉 + T〈ρ ln ρ〉, where ρ satisfies probability conservation, 〈ρ〉 = 1. We augment FM with a Lagrange multiplier to account for probability conservation, L = FM + η〈ρ〉, and take the derivative δL/δρij = 0. The resulting distributions are ρij(Sij) = exp(gijSij)/〈exp(gijSij)〉, where the effective fields are gij = ∂HS/∂Sij = ∂(hi + hj)/∂Sij = aibij[∏k(i)≠j(1 + biksik) + ∏k(j)≠i(1 + bjksjk) − 2] with sij = 〈ρSij〉 = 〈Sijρij(Sij)〉. From the n = 0 property, it follows that 〈exp(gijSij)〉 = Σk gijkSk〉/k! = 1 + gij2/2 and 〈Sijexp(gijSij)〉 = gij. This leads to the self-consistency relations, sij = 〈Sijρij(Sij)〉 = 〈Sij exp(gijSij)〉/〈exp(gijSij)〉 = gij(1 + gij2/2)−1 (Eq. 7). In a similar fashion, we obtain the mean-field approximation for the free energy, FSFM = −Σihi + Σijgijsij − Σij ln(1 + gij2/2), and it is easy to verify that Eq. 7 defines the extremum ∂FS/∂sij = 0. Eq. 7 is analogous to the self-consistency relation of an Ising magnet, s = sinh(gs)/cosh(gs); the different form is due to the truncation of the power-series expansion of the hyperbolic functions thanks to the n = 0 property. Solving Eq. 7 for gij as a function of sij, we find that the solution can be described graphically as the points where the function eij = gijsij/2 crosses the ellipse 2sij2 + (2eij − 1)2 = 1 (Fig. 3A).

Use of Euler's Characteristic.

When we apply Eq. 3, an underlying assumption is that the embedding is cellular (10); i.e., every meaning island is homeomorphic to an open disk. This is not necessarily true when the number of islands is less than the number of holes in the surface, that is the genus, γ = 1 − χ/2. In this case, some islands are expected to wrap between two holes and therefore are not homeomorphic to a disk. However, in the “thermodynamic limit” of many islands per hole, nf ≫ |χ|, the embedding is mostly cellular and is well approximated by Eq. 3.

Footnotes

The author declares no conflict of interest.

This article is a PNAS Direct Submission.

See Commentary on page 8165.

References

  • 1.Crick FH. The origin of the genetic code. J Mol Biol. 1968;38:367–379. doi: 10.1016/0022-2836(68)90392-6. [DOI] [PubMed] [Google Scholar]
  • 2.Tlusty T. A model for the emergence of the genetic code as a transition in a noisy information channel. J Theor Biol. 2007;249:331–342. doi: 10.1016/j.jtbi.2007.07.029. [DOI] [PubMed] [Google Scholar]
  • 3.Voituriez R, Joanny JF, Prost J. Generic phase diagram of active polar films. Phys Rev Lett. 2006;96 doi: 10.1103/PhysRevLett.96.028102. 028102. [DOI] [PubMed] [Google Scholar]
  • 4.De R, Zemel A, Safran SA. Dynamics of cell orientation. Nat Phys. 2007;3:655–659. [Google Scholar]
  • 5.Shannon CE, Weaver W. The Mathematical Theory of Communication. Urbana: Univ of Illinois Press; 1949. [Google Scholar]
  • 6.Tlusty T. Rate-distortion scenario for the emergence and evolution of noisy molecular codes. Phys Rev Lett. 2008;100 doi: 10.1103/PhysRevLett.100.048101. 048101–048104. [DOI] [PubMed] [Google Scholar]
  • 7.Rose K, Gurewitz E, Fox GC. Statistical-mechanics and phase-transitions in clustering. Phys Rev Lett. 1990;65:945–948. doi: 10.1103/PhysRevLett.65.945. [DOI] [PubMed] [Google Scholar]
  • 8.Tlusty T. A relation between the multiplicity of the second eigenvalue of a graph Laplacian, Courant's nodal line theorem and the substantial dimension of tight polyhedral surfaces. Elec J Linear Algebra. 2007;16:315–324. [Google Scholar]
  • 9.Berger T. Rate Distortion Theory. Englewood Cliffs, NJ: Prentice–Hall; 1971. [Google Scholar]
  • 10.Gross JL, Tucker TW. Topological Graph Theory. New York: Wiley; 1987. [Google Scholar]
  • 11.de Gennes PG. Exponents for the excluded volume problem as derived by the Wilson method. Phys Lett A. 1972;38:339–340. [Google Scholar]
  • 12.de Gennes PG. Scaling Concepts in Polymer Physics. Ithaca, NY: Cornell Univ Press; 1979. [Google Scholar]
  • 13.Zilman AG, Safran SA. Thermodynamics and structure of self-assembled networks. Phys Rev E. 2002;66 doi: 10.1103/PhysRevE.66.051107. 051107. [DOI] [PubMed] [Google Scholar]
  • 14.Wang ZG, Costas ME, Gelbart WM. Flexible micelles and the n → 0 vector spin model. J Phys Chem. 1993;97:1237–1242. [Google Scholar]
  • 15.Wheeler JC, Pfeuty P. The n → 0 vector model and equilibrium polymerization. Phys Rev A. 1981;24:1050. [Google Scholar]
  • 16.Tlusty T. A simple model for the evolution of molecular codes driven by the interplay of accuracy, diversity and cost. Phys Biol. 2008;5 doi: 10.1088/1478-3975/5/1/016001. 016001. [DOI] [PubMed] [Google Scholar]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES