Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Apr 3.
Published in final edited form as: Bull Math Biol. 2012 Nov 14;74(12):2897–2916. doi: 10.1007/s11538-012-9788-z

An exactly solvable model of random site-specific recombinations

Yi Wei 1, Alexei A Koulakov 1
PMCID: PMC5378498  NIHMSID: NIHMS852674  PMID: 23151958

Abstract

Cre-lox and other systems are used as genetic tools to control site-specific recombination (SSR) events in genomic DNA. If multiple recombination sites are organized in a compact cluster within the same genome, a series of random recombination events may generate substantial cell specific genomic diversity. This diversity is used, for example, to distinguish neurons in the brain of the same multicellular mosaic organism, within the brainbow approach to neuronal connectome. In this paper we study an exactly solvable statistical model for SSR operating on a cluster of recombination sites. We consider two types of recombination events: inversions and excisions. Both of these events are available in the Cre-lox system. We derive three properties of the sequences generated by multiple recombination events. First, we describe the set of sequences that can in principle be generated by multiple inversions operating on the given initial sequence. We call this description the ergodicity theorem. On the basis of this description we calculate the number of sequences that can be generated from an initial sequence. This number of sequences is experimentally testable. Second, we demonstrate that after a large number of random inversions every sequence that can be generated is generated with equal probability. Lastly, we derive the equations for the probability to find a sequence as a function of time in the limit when excisions are much less frequent than inversions, such as in shufflon sequences.

Keywords: Site-specific recombination (SSR), Markov process, ergodicity

1. Introduction

Site-specific DNA recombination is a useful mechanism that can generate genomic diversity. Mosaic animals, that carry distinguishable genomes in different cells of the same organism, have become important tools in studying its organization. For example, in brainbow mice, neurons carry randomly colored fluorescent dies, which allow to distinguish neighboring cells with the purpose to establish their connectivity (13). Next generation sequencing methods allow to trace stem cell lineage by identifying progeny with similar genetic barcodes (4). To implement this strategy, stem cells have to carry distinguishable genetic sequences. Site-specific DNA recombination could be used to generate such a sequence diversity. Finally, mosaic sequence diversity may help study connections between neurons using the next generation sequencing technologies (5, 6). Here we study the statistical model for SSR events with the purpose of understanding both the diversity of their products and their probability distributions.

Cre-lox system can be used as a tool to both control gene activation (79) and generate randomly diverse DNA sequences (13). A loxP site (locus of X-over P1) is a 34bp sequence segment consisting of two 13bp inverted complementary repeats separated by an oriented 8 bp asymmetric region (ATAACTTCGTATA – GCATACAT - TATACGAAGTTAT) (9). The orientation of loxP recombination site is determined by the central 8bp region. The Cre recombinase is an enzyme that mediates SSR at loxP sites. Such recombination events occur between two loxP sites and the outcome depends on the relative orientation of the loxP sites. When two loxP sites that have opposite orientation are recombined, an inversion occurs. After an inversion, the DNA segment between two loxPs is flipped: its orientation is inverted and the sequence is replaced with the complement (Figure 1) (10). The remaining parts of the original sequence which are not involved in the inversion are left unchanged. The recombination between two loxP sites of the same orientation yields an excision, whereby one of the loxPs and the whole segment between the two loxPs are removed from the sequence (Figure 2). Other variants of loxP have been identified, such as lox2272 and loxN, that can mediate Cre-based recombination to produce multicolored mouse brains with several fluorescent proteins in brainbow mice (13).

Figure 1.

Figure 1

Two types of inversions. Inversion (A) is between an R (right) and an L (left) SSR-sites. Inversion (B) is between an L and an R SSR-sites.

Figure 2.

Figure 2

Two examples of excisions. Excision (A) is between two R SSR-sites and excision (B) is between two L SSR-sites.

In other systems, such as rci-R64 recombination, the excision events are not reported (11). Instead, the recombination leads to an inversion. These systems are therefore called shufflons (11). For the most part, our study will be relevant to this type of system, because we will assume that excision events are either non-existent (Section 2 and 3) or rare (Section 4). In our study site specific recombination sites are called SSR-sites. Our model may apply to Cre-lox and other systems, such as rci-R64, Cre-lox2272/loxN, and others (13, 11).

We make several assumptions about the random recombination processes. First, we will assume that inversions and excisions can be described by independent Markov processes with probabilities that do not depend on time. Second, we will assume that when an event (inversion or excision) happens to a DNA sequence, it happens to all the SSR-sites, which satisfy the conditions for the corresponding recombination, independently with equal probability.

In this paper we answer the following questions regarding the inversion process:

Q1. Given an arbitrary initial DNA sequence, what are the sequences that can appear after applying an arbitrary number of inversions? What is the number of sequences that can be generated by such a process? These questions are addressed in Section 2 below.

Q2. Is there a unique equilibrium distribution of the DNA sequence configurations after a sufficiently large number of random inversions? What is the probability of observing one particular DNA sequence when the equilibrium is reached after a large number of inversions? This question is addressed in Section 3.

Combining both inversion and excision processes, we will answer to the following question.

Q3. Assume that excisions are much slower than inversion, so that excisions occur after an equilibrium due to inversions has been reached. Given an arbitrary initial DNA sequence at time T=0, what is the probability of observing a specific DNA sequence configuration at time T=t? This question is addressed in Section 4 of this study.

2. The diversity of the results of multiple inversions

In this section we will derive the sequences that can be obtained from an initial DNA sequence after a number of inversions. We will study the cluster of SSR-sites of different orientations located on the same DNA segment in close proximity so that inversions are possible between any pair of appropriate orientation. We will assume that inter SSR-sites DNA segments carry distinguishable sequences. We will enumerate the possible sequences that can be produced by multiple inversion events, given this initial sequence.

Before we rigorously derive our result, we briefly present our main finding of this Section. Given an initial sequence (Figure 3A), all of the sequences that can be generated by inversions can be obtained as follows. First, one cuts the sequence into segments containing nearest pairs of SSR-sites and inter-SSR-sites DNA sequences. We call such segments units. Second, one builds a dictionary of units that contains both direct and inverted sequences (Figure 3B). These units can be recombined into new sequences (Figure 3C) by satisfying the following two rules. The first rule is that each unit from the original sequence should be used once and only once. The second rule postulates that the orientations of SSR-sites have to agree between the edges of neighboring units (Figure 3C). In this Section we show that any sequence that follows these two simple rules can be obtained by inversions from the original sequence. This means that inversions can generate any sequence within a cluster of SSR-sites that does not generate new units. We call this property of the inversion process, proven in Theorem 2, the ergodicity property.

Figure 3.

Figure 3

Ergodicity of inversions. To enumerate all sequences that can be obtained from an original sequence (A) by inversions, one first generates the dictionary of units (B). The units include direct and inverted segments located between two neighboring SSR-sites. Units also include the information about the direction of SSR-sites (gray triangles). The units can be recombined into new sequences that include each unit only once and respect the orientation of SSR-sites (C). All of the sequences that satisfy these constraints can be produced by inversions from the original sequence (A). This ergodicity property of inversions is proven here in Theorem 2. The total number of such sequences is given by Eq. (2.2) and is denoted by ZM,N.

In a DNA sequence with SSR-sites, every two adjacent SSR-sites and the DNA segment between them form a unit. Two adjacent units share one common SSR-site. A DNA sequence can be viewed as a chain of such units. It is convenient to classify units according to the orientations of their SSR-sites. For example, an RR unit and an RL unit are shown in Figure 4.

Figure 4.

Figure 4

Examples of an RR unit and an RL unit.

In order to find all possible configurations resulting from inversions, it is important to note some properties of the inversion operation of DNA sequences.

Property 1

An RR unit is transformed into an LL unit if it is inside the part of DNA sequence that has been inverted (Figure 5). Otherwise it remains unchanged. Similarly, an LL unit will either be transformed into an RR unit or remain unchanged. An RL (LR) unit remains an RL (LR), but the DNA sequence between SSR-sites is either unchanged, if the unit is not inverted, or changed to reverse complement as a result of inversion.

Figure 5.

Figure 5

An RR unit transforms to an LL unit after an inversion.

Property 2

The orientations of SSR-sites on both ends of the whole DNA sequence remain invariant under all inversions and excisions.

Property 3

RL units and LR units are distributed alternatively along a sequence. An RL unit is separated by RR units (or nothing) from its left neighboring LR unit and by LL units (or nothing) from its right neighboring LR unit.

These observations lead to the following definition.

Definition 1

The rank N of a DNA sequence is the total number of its RL and LR units, i.e. N=NRL+ NLR. By Property 1, N is invariant with respect to inversions.

The configuration of a DNA sequence, including the orders and the orientations of the SSR-sites and DNA segments, can be decomposed into the SSR-site part and the inter-SSR-site segment part. The SSR-site part defines a SSR-site array, see Figure 6. It is convenient to work first in the SSR-site space, defined as the set of all possible SSR-site configurations of a given DNA sequence, and to restore the segment part in the final step. For SSR-site array, a unit is a pair of adjacent SSR-sites (which is just a unit in the DNA sequence without the DNA segment).

Figure 6.

Figure 6

A configuration of a DNA sequence can be represented as a ‘direct sum’ of its SSR-site array and inter-SSR-site segment array.

Let us assume the DNA sequences start with an R SSR-site.

Definition 2

A SSR-site array is called in the canonical form if it has no LL units and all the RR units are located on the left part of the sequence (Figure 7).

Figure 7.

Figure 7

A SSR-site array in the canonical form, beginning with 3 RR units followed by alternating RL and LR units.

Theorem 1

Any SSR-site array can be transformed into the canonical form with a finite number of inversions.

Proof

For notational convenience, we assign fixed symbols to certain SSR-sites in an array. In the process of inversions, a SSR-site may change its orientation as well as position but its symbol is never changed and always attached to it. Denote an inversion between SSR-site a and b by I=(a, b). The composition of two consecutive inversions, first between a and b and then between c and d, is written as I=(a, b) ○ (c, d). The same notation is adopted for multiple inversions.

Next we define two types of procedures, as shown in Figure 8. Note that after each procedure, the first two SSR-sites are changed from RL to RR.

  1. As already mentioned, we always assume an array starts with an R SSR-site. If the array is partially in the canonical form, i.e. it begins with a continuous series of RR’s and then a series of alternating RL and LR until an occurrence of an LL, perform a type-A inversion between this LL and the first RL, as shown in Figure 8A.

  2. If the sequence is partially in the canonical form until an occurrence of an RR to the right of the canonical part, perform a type-B inversion between this RR and the first RL, as shown in Figure 8B.

Figure 8.

Figure 8

(A). Type-A inversion. (B). Type-B inversion.

The result of each of these two operations is that the subsequence in the canonical form is increased by 1 unit. Repeat step 1–2, until there is no more LL or RR unit to the right of any L SSR-site, i.e. the whole sequence is in the canonical form. □

By Theorem 1, if a set of SSR-site arrays can be transformed into the same canonical form, they can be transformed into each other with a finite number of inversions.

Note: Without loss of generality, we will only study DNA sequences starting with an R and ending with an L SSR-site in remaining part of this paper. Hence, all formulae in the remaining of this paper are for DNA sequences of this particular form. Results for sequences ending with an R SSR-site can be derived in the same way and are essentially the same.

Definition 3

If a SSR-site array A can be transformed into another array B by a finite number of inversions, we say A and B are equivalent, denoted by A ~ B.

Clearly, if A ~ B and B ~ C, then A ~ C.

By Property 3, the number of RL and LR units of an array are both invariant with respect to inversions. In fact, it is easy to show that a rank-N SSR-site array has [N+12] number of RL and [N2] number of LR, where [x] = max(n ≤ x | n is integer). Let (M, N) denote the set of rank-N SSR-site arrays with M number of RR and LL units, i.e. M = #RR + #LL. By Theorem 1, all arrays in (M, N) can be transformed into the same canonical form, which is just a special element in the set which has M number of RR units. Therefore any two arrays in (M, N) can be transformed into each other by a finite number of inversions, i.e. arrays in (M, N) are equivalent to each other. This observation leads to the following conclusion.

Property 4

The set (M, N) of SSR-site arrays forms an equivalence class.

Definition 4

Let F be a function defined on SSR-site arrays with fixed length. If the value of F is determined only by the rank of sequences, F is called a class function. A class function takes a constant value in an equivalence class (M, N).

By Property 3 and Definition 1, arrays with different ranks are not equivalent. Let S be the set of SSR-site arrays with N total units. S is decomposed into a direct sum of N number of equivalence classes with different ranks, i.e. S=(N-1,1) ⊕…⊕ (0,N). Under all possible inversions, a SSR-site array will span the whole equivalence class it belongs to.

We are now ready to restore the DNA segments and answer Q1 based on the properties of SSR-site arrays. First, we make the following observation.

Property 5

If the SSR-site array of a DNA sequence is in the canonical form, all permutations of its RR units can be achieved by inversions, while keeping the array in the canonical form and configurations of RL and LR units unchanged. Similarly, all permutations of the RL (LR) units can be achieved without affecting the RR and LR (RL) parts.

Proof

  1. We first show that an arbitrary pair of segments of neighboring RR units can be exchanged while leaving the remaining array unchanged. It is sufficient to elucidate the procedure with an example shown in Figure 9. It is straightforward to check that the operation I1 = (a,d) ○ (b,d) ○ (a,c) ○ (c,d) exchanges segments A and B and leaves all other units unchanged. Since an arbitrary permutation of segments in RR units can be generated by a series of pair exchanges (because very element of symmetry group SN can be written as a product of exchanges), we complete the proof of the first claim.

  2. Similarly, in Figure 9, I2 = (d,g) ○ (d,e) ○ (d,f) ○ (d,g) exchanges segments D and F and leaves all other units unchanged. By the same argument above, we conclude that when the SSR-site array is in the canonical form, we can get, by inversions, all permutations of segments of LR units without modifying the remaining of the sequence. In the same way we can prove the same argument holds for RL units. □

Figure 9.

Figure 9

An example of exchanging two RR units without modifying all others units in the sequence.

Using Theorem 1 and Property 1–5, we now summarize this section by answering Q1 with the following theorem.

Theorem 2

(Ergodicity) Let the SSR-site configuration of the initial DNA sequence be a state in (M, N). Any DNA sequence satisfying the following three conditions can be reached by applying a finite number of inversions on the initial sequence. C1. Its SSR-site configuration is a state in the same class (M, N); C2. Its RL (LR) units, or their inversions, are the RL (LR) units of the initial sequence; C3. Its RR (LL) units are the RR (LL) units, or LL (RR) units with inverted inter-SSR-site segments, of the initial sequence.

The total number of possible SSR-site arrays in class (M,N) is given by

d(M,N)=CM+NN=(N+M)!N!M!. (2.1)

For a given initial sequence, assuming that inter-SSR-site sequences are distinguishable and non-symmetric, the total number of unique configurations that can be obtained by inversions is

ZM,N=2NM![N+12]+![N2]!d(M,N). (2.2)

Where, as before, we used the notation [x] = max(n ≤ x | n is integer). According to Theorem 2, all of these sequences can be reached by a finite set of inversions from any initial sequence. The number of configurations ZM,N can be measured using DNA sequencing and is therefore an experimentally testable prediction.

3. Probabilities and statistics associated with random inversions

In this section we study the statistical properties of SSRs under the random inversions. By our initial assumption, all possible inversions of any sequence are independent and happen with the same probability. Under this assumption, here we show that after a sufficiently large number of inversions, the configurations reach a unique equilibrium distribution. We show that in this distribution all possible configurations described in the previous chapter are equally likely (Theorem 3).

We first define the random process that will be used in this chapter. We will denote the probability to observe a configuration number i at time t by the vector Ψi(t). The configuration may be defined either by the array of SSR-sites of or by both SSR-sites and intermediate DNA sequences. We will derive general properties of the random process first without making the definition of the configuration more concrete. Because all inversions occur with the same probability, we can relate two vectors at two near time points separated by a short interval Δt:

Ψi(t+Δt)=rinvΔtjR^ijΨj(t)+Ψi(t)[1-rinvΔtkR^ki] (3.1)

In this expression rinv is the rate with which a single inversion occurs, R^ij is the number of inversion that connects states i and j that is usually either 0 or 1, and the last term is needed to ensure the conservation of probability, i.e. that iΨi(t)=1 for any t. By differentiating this expression with respect to Δt and setting it to 0, we obtain

dΨi(t)dt=jrinv(R^ij-Ciδij)Ψj(t). (3.2)

Here Ci=jR^ij. This equation has the following solution

Ψ(t)=exp(rinv[R^-C^]t)Ψ(0), (3.3)

Where the elements of the diagonal matrix are given by Cij = Ciδij. Using this equation we will study the equilibrium distribution of sequences at the end of a large number of rotations.

Equation (3.3) describes a continuous-time Markov process with the transition probability matrix T^exp(rinv[R^-C^]t). The equilibrium distribution of this process can be obtained as a limit Ψ=limtexp(rinv[R^-C^]t)Ψ(0). Because of the ergodicity theorem 2 and the fact that T^ is non-negative, this distribution is unique and does not depend on the initial state Ψ(0). Clearly, the elements of Ψk are non-zero only for a subset of states that are reachable from the initial configuration, i.e. for ZM,N of such states [Eq. (2.2)]. Because the transition matrix T^ is symmetric, the non-zero elements of Ψk are the same and equal to 1/ZM,N (12). This follows from the detailed balance condition pertinent to symmetric Markov processes in the equilibrium, i.e. TknΨn = TnkΨk (12). We therefore arrive to the following theorem:

Theorem 3

There exists a unique vector Ψ, such that for an arbitrary initial state Ψ(0), Ψ(0),limtT^(t)Ψ(0)=Ψ. All non-zero components of vector Ψ are the same. This theorem means that all of the states that can be reached from an arbitrary initial state are represented in the equilibrium with equal probability. Interestingly, this result is valid for both configurations that include SSR-sites only and the complete sequences.

To give an example of the transition matrices we will consider the sequence defined by SSR-site configuration only (Figure 10). The possible states of recombination sites only (Figure 10, bottom) help define the transition matrix R^=(111121111). This matrix means that there are two inversions that leave state 2 invariant and there is one inversion for all other transitions. The rate of transitions is determined by the matrix R^-C^=(211121112) that clearly has an eigenvalue of λ=0 corresponding to the constant eigenvector Ψ = (1/3 1/3 1/3). This eigenvector represents the distribution at t → ∞. The other eigenvalue λ = −3 corresponds to the eigenvectors that decays over time.

Figure 10.

Figure 10

An example of SSR-site-only configuration. Top: The initial sequence. Bottom: Possible configurations of SSR-sites.

If the transition matrix is defined for the complete sequences, we anticipate that the elements of R^ can be either 0 or 1. This is a consequence of the inter SSR-sites DNA segment having unique sequences. The matrix rR^ defines the transition rates for the inversion processes. Our Proposition 1 relied on the symmetry of this matrix only. Thus, if inversion rates depend on the distance within the pair of SSR-sites, as long as the inversion process is reversible, Theorem 3 is expected to be valid. Thus, we suggest that the equal probability Theorem 3 is valid even if not all of the inversions are equally likely and transition probabilities are dependent upon the length of inter SSR-sites stretch.

4. Probabilities and statistics of inversions and excisions

In this section we consider both inversions and excisions. We will answer Q3 of the introduction, i.e. we will derive the probability of observing a sequence given the initial sequence as a function of time. We will assume that the excisions events are much rarer than the inversion events. We will assume that between two excisions, the sequence distribution reaches equilibrium due to inversions. This assumption allows us to find solutions for the probability of observing a sequence. We will find the answer by two different methods. First, we will use the summation over paths to determine probabilities of transitions. Second, we will employ the method based on matrix exponential. We will show that these methods give the same result.

First we will consider a simplified version of Q3 of the Introduction. We will only address the SSR-sites and will not include inter-SSR-site DNA segments. The probability distribution of the latter can be evaluated on the basis of equal probability argument (Theorem 3).

Q3′: Starting with a state in a class (M0, N0) at time T=0, what is the probability of finding the sequence in the class (M1, N1) at T=t?

First we will use the summation over paths method. To answer Q3′, we note that all possible (M1, N1) which are reachable from (M0, N0) must satisfy the conditions ΔM=M0-M1 ≥ 0 and ΔN=N0-N1 = 2k, where k ≥ 0. In fact, all classes (M1, N1) that satisfy these two conditions can be reached from certain states of (M0, N0) with a single (1-step) excision. Although it is not necessarily true that the class (M1, N1) can be reached by 1-step excision from every state of (M0, N0).

We define two quantities associated with the excision transitions between two classes. Let xi(M,N) be the total number of possible excisions of the i-th state of the class (M,N). Then we have xi(M,N)=W-Ci, where W=12(M+N)(M+N+1) and Ci is the connectivity of the state i of class (M, N), defined as Ci=(r+[N+12])(Mr+[N+12]). Here r is the number of RR units. Define x(M,N) as the average number of excisions of states in (M, N), then we have x(M,N)=W-1d(M,N)i(M,N)Ci, where d(M,N) is the total number of states in the class (M, N) [(Eq. (2.1)]. Define D(M,N)(r) as the number of states in a class (M, N) which has r number of RR units. It is straightforward to show that D(M,N)(r)=Cr+[N2]r×CM-r+[N12]M-r. Then we have

i(M,N)Ci=r=0MD(M,N)(r)C(r)={k2CM+2k+1M,N=2k1k(k1)CM+2kM,N=2k2.

Let yi(M0,N0)(M1,N1) be the total number of 1-step excisions leading to (M1, N1) from the i-th state of (M0, N0) and y¯(M0,N0)(M1,N1) be the average of yi(M0,N0)(M1,N1), i.e. y¯(M0,N0)(M1,N1)=1d(M0,N0)i(M0,N0)yi(M0,N0)(M1,N1). The sum is calculated by noticing the following facts. For a state μ ∈ (M1,N1), let μ¯ be the set of SSR-site array which, when inserted into μ as in Figure 11, make the combined arrays states of (M0, N0). There are two types of arrays in the set μ¯, see Figure 11. Type-I arrays begin and end with R SSR-site and correspond to the excisions shown in Figure 2A. By definition, type-I sequences belong to the class (ΔM,ΔN), where ΔM=M0-M1 and ΔN=N0-N1. Similarly, type-II arrays belong to the equivalence class (ΔM,ΔN)T, which begin and end with L SSR-sites, corresponding to the excisions shown in Figure 2B.

Figure 11.

Figure 11

Type-I arrays belong to the class (ΔM,ΔN). A type-I arrays can be inserted to each R SSR-site of a state in (M1, N1), and the outcome is an array (M0, N0). Type-II arrays belong to the class (ΔM,ΔN)T and should be inserted to L SSR-sites.

There is a 1-to-1 correspondence between the arrays of type-I and type-II by inversion. Let vI be an array of type-I, then the inversion of vI is an array of type-II and is denoted by vIT. From Figure 11 caption we find

y¯(M0,N0)(M1,N1)=1d(M0,N0)i(M0,N0)yi(M0,N0)(M1,N1)=(M1+N1+1)d(M1,N1)d(M0-M1,N0-N1)d(M0,N0). (4.1)

Note that although type-I and -II arrays are not the typical SSR-site arrays we have being working with, i.e. begin with an R site and end with an L site, the dimension formula [Eq. (2.1)] still applies. This is because they both can be transformed into corresponding canonical forms and then all similar derivations follow.

Let P(M0,N0)(M1,N1)k-step(t) be the probability of starting with a state in (M0, N0) at T=0 and finding the array in a state of (M1, N1) at T=t after k excisions. Denote the total probability by P(M0,N0)(M1,N1)(t), i.e P(M0,N0)(M1,N1)(t)=k=0kmaxP(M0,N0)(M1,N1)k-step(t), where kmax=ΔM+[ΔN2]. Assume that the rate of inversions rinv is much higher than the rate of excisions rexc, i.e. rinv ≫ rexc, so that inversions reach equilibrium between excisions. Then the probability to remain within class (M0, N0) is given by P(M0,N0)(M0,N0)(t)=exp(rexcx0t), where we used a shorthand notation X0=X(M0,N0). The probability of transition from (M0,N0) to (M1,N1) by time t with one excision is

P(M0,N0)(M1,N1)1-step(t)=0tdty01rexcexp[-rexcx0t-rexcx1(t-t)]=y01x0-x1[exp(-rexcx1t)-exp(-rexcx0t)] (4.2)

where y01=y¯(M0,N0)(M1,N1) and x1=x(M1,N1). Similarly, the k-step transition probability through an excision path σ: (M0,N0) → (M1,N1) → ⋯ →(Mk,Nk), can be computed as

Pσ(t)=0tdt0y0,1rexcexp(-rexcx0t0)0tdt1y1,2rexcexp[-rexcx1(t1-t0)]tk-2tdtk-1yk-1,krexcexp[-rexcxk1(tk-1-tk2)-rexcxk(t-tk-1)]=(i=0k-1yi,i+1)detM(t)Δ(x0,,xk) (4.3)

where Δ(x0,,xk)=0ijk(xj-xj) is the determinant of the Vandermonde matrix of x (13) and we defined

M(t)=(exp(-λexcx0t)exp(-λexcx1t)exp(-λexcxkt)111x0x1xkx0k-1x1k-1xkk-1) (4.4)

Note that in the derivation of Eq.(4.2)(4.3), we made the approximation by assuming that the array reaches the equilibrium of inversions right after each excision. This approximation is valid only when inversions happen very fast, i.e. rinv ≫ rexc or the number of excisions in [0, t] is small. This is the case by our assumption.

To find P(M0,N0)(M1,N1)k-step(t), we need to sum over all possible excision paths,

P(M0,N0)(Mk,Nk)k-step(t)=σPσ(t), (4.5)

where the sum is over all k-step transitions between (M0, N0) and (Mk, Nk).

To find P(M0,N0)(M1,N1)(t), we define two M[N/2]×M[N/2] dimensional matrices: 1. an upper triangular matrix Y=(Yij), i,j=1,…,M[N/2], with elements Yij=y¯(Mi,Ni)(Mj,Nj), for i<j, and Yij = 0, i ≥ j. 2. a diagonal matrix X^, with elements Xii=x(Mi,Ni), i=1,…,M[N/2]. According to the properties of x(M,N), the matrix X^ is such that Xii=jYij.

The following theorem answers Q3′:

Theorem 4

Let P(Mi,Ni)(Mj,Nj)(t) be the probability of starting from an array in (Mi,Ni) at T=0 and finding the array belongs to (Mj,Nj) at T=t. Then we have

P(Mi,Ni)(Mj,Nj)(t)=(exp[λexct(Y^-X^)])ij. (4.6)

Proof

Define a matrix P(t), with Pij(t)=P(Mi,Ni)(Mj,Nj)(t), i,j=1,…,M[N/2]. By definition, we have

Pij(t+dt)=Pij(t)(1-Xijrexcdt)+kPik(t)Ykjrexcdt. (4.7)

From which, we find P(t) satisfies the matrix equation

dP^(t)dt=rexcP^(Y^-X^). (4.8)

Solving this equation, we find P^(t)=exp[rexct(Y^-X^)]. □

To check that Theorem 4 agrees with Eq. (4.5), we collect terms in the Taylor expansion of Eq. (4.6) with k number of Y’s and find coefficients corresponding to each transition path σ equal to those in Eq. (4.4).

We can now restore the inter-SSR-site DNA segments. Let ℙ(t) be the probability of starting with a DNA sequence in (M0, N0) at T=0 and finding a particular DNA sequence in (M1, N1) at T=t. By the symmetry among RR and LL units stated in Theorem 3, to have the desired set of RR and LL units in the final sequence, we need to multiply P(M0,N0)(M1,N1)(t) by a factor 1/CM0M1. Similarly, we need to multiply P(M0,N0)(M1,N1)(t) by factors 1/C[N02+1][N12+1] and 1/C[N02][N12] to have the desired RL and LR units. And to have all the segments in the correct order and orientation, we need to multiply by the probability by 1/ZM1,N1 given by Eq. (2.2). Define the quantity Z^=CM0M1C[N02+1][N12+1]C[N02][N12]ZM1,N1, finally we get

P(t)=1Z^P(M0,N0)(M1,N1)(t). (4.9)

Theorem 4, Eq.(4.5) and Eq.(4.9) answer Q3.

5 Discussion

In this paper we derived the properties of random recombinations operating within a cluster of loxP or similar SSR-sites (13, 11). We addressed several questions pertaining to the properties of distributions of the resulting sequences. First, in Sections 2 and 3, we analyzed the processes of inversions only. We assumed that excisions between two recombination sites are nonexistent. This approximation is appropriate for the rci-R64 recombination system that is known as shufflon (11). Later, in Section 4, we included excisions into consideration assuming that they occur very infrequently.

We obtained two important results regarding the processes of inversion. First, we showed what DNA sequences are possible to obtain from an original sequence by an arbitrary set of inversions. The main conclusion is that any sequence that includes segments from the original sequence can be reached. Therefore we called this property the This property is defined ergodicity of the inversion process. more precisely in Theorem 2. On the basis of this property we derived the number of sequences that can be produced from any initial sequence by an arbitrary set of inversions. This number is given by equation (2.2) and is an experimentally testable prediction of our theory. This number of sequences can be studied using modern sequencing methods. In particular, assume that one initially has a large population of identical sequences of SSR-sites. Subsequent introduction of a restriction enzyme, such as rci or Cre, may lead to inversions that diversify the sequences. The number of resulting unique sequences can be counted by using sequencing methods and compared to Eq, (2.2). Below we present some examples of using this equation in simple cases.

Overall, our conclusions in Theorem 2 and 3 suggest that site specific DNA recombination can lead to diverse ensembles of sequences. Further, in Section 4, we derive an equation for the probability of obtaining sequences as a function of time when excisions are included, assuming that these processes are much slower than inversions.

Simple example

To illustrate our findings, let us consider the DNA sequence shown in Figure 10. Since the SSR-site array contains two RR or LL units and only one RL unit, this sequence belongs to the class (2,1), i.e. M=2, N=1. By Eq. (2.1), its SSR-site array has d(2,1)=CM+NN=C31=3 configurations, shown in Figure 10 above.

By Eq. (2.2), the total number of possible DNA sequences is ZM,N=2NM![N+12]![N2]!d(M,N)=12. Here and above the notation [x] means the largest integer smaller than or equal to x. We show all these 12 configurations explicitly in Figure 12.

Figure 12.

Figure 12

For the given initial DNA sequence |1〉, there are 12 DNA configurations which can be generated by applying inversions on inverted SSR-sites.

We also show (Theorem 3) that after a large number of inversions, when equilibrium distribution of sequences is reached, all of the sequences that can be obtained, are represented with equal probability. This implies that various possible sequences become equally likely after a large number of inversions. To illustrate this here, we simulated the random inversions, according to the assumptions of the Introduction, beginning from the initial configuration |1〉 (Figure 13A). We find that all 12 configurations listed in Figure 12 appear with the same probability (Figure 13B) as predicted by Theorem 3.

Figure 13.

Figure 13

(A) Simulation of the first 100 random inversions on initial DNA sequence |1〉 in Figure 12. Indices of vertical axis represent DNA sequences defined in Figure 12. The probability of an inversion to occur during one step is rinv = 0.1. (B) Comparison of frequencies of all 12 possible DNA configurations appearing in 100,000 sequential random inversions. The total number that |1〉 appears is normalized to 1.

Example of shufflon sequence in plasmid R64

Here we will give an example of applying our results to the bacterial plasmid R64 that contains shufflons (Figure 14). The sequence contains seven SSR-sites and six segments between them. Out of these six segments, only four contain coding regions: A, B, C, and D (Figure 14A). The remaining two segments, denoted by I1 and I2 are non-coding. Here we will calculate the number of combinations possible in the coding regions, using our equations.

Figure 14.

Figure 14

(A) The initial DNA sequence with shufflon segments A–D and SSR sites that are called sfx sequences 1–7 (11). (B) All possible DNA sequences that can be achieved by the constrained inversions on the initial sequence in (A). Each empty space in sequences in B is filled with one unit from (C), in all possible orders. The total number of possible sequences is 6·3!·8=288.

Because R64 contains M=1 LL or RR units and N=5 LR or RL units. The number of combinations of SSR-sites is given by d(1,5)=CM+NN=C65=6. All of these six SSR-site configurations are shown in Figure 14B. The total number of configurations is given by Eq. (2.2) that yields Z1,5 = 2304. This number of combinations includes the variance in the non-coding regions I1 and I2. If we are not to include permutations with non-coding regions, we would have to divide this number by 8, the total number of non-coding combinations. Thus get 288 permutations in the coding regions only.

In certain systems, only inversions between RL SSR-sites, but not LR sites are possible (14). This case is addressed in Appendix. In the appendix we show that the number of combinations in the constrained case is generally one half of that in the unconstrained case. We thus expect to have 1152 combinations that include both coding and non-coding units. To obtain the number of coding combinations only we have to divide this number by four possible in the constrained case configurations of the non-coding units (see Appendix). We thus obtain 288 combinations for coding sequences only. This number is the same for both unconstrained and constrained inversions. To confirm these numbers in Figures 14B and C we demonstrate how all of these 288 configurations can be obtained with RL inversions.

Acknowledgments

The authors thank Tony Zador for suggesting this problem to us and multiple valuable comments. The authors thank Dawen Cai, Jeff Lichtman and Teruya Komano for a helpful communications. This work was supported by NIH R01EY018068 and R01MH092928 and Swartz Foundation. AK acknowledges the hospitality of the Aspen Center for Physics, which is supported in part by NSF Grant No. PHY-1066293.

Appendix

Constrained inversions

In this section we generalize the results obtained in Sections 2 and 3 to systems in which inversions can happen between inverted SSR-sites Inline graphic (RL) but not between matching SSR-sites Inline graphic (LR) (14). We call this type of recombination, when only one type of inversions is possible, the case of constrained inversions. As before, we assume the DNA sequence starts with an R and ends with an L SSR-site. Here we will show that most of the results obtained in the present study can apply in the case of constrained inversions. However, some of sequences cannot be obtained due to the constraint, as detailed below.

Figure 15.

Figure 15

Constrained inversions. (A) An initial sequence. (B) Sequences that can be obtained from (A) using inversions between RL but not LR sites. (C) Sequences that cannot be obtained from (A) with the constrained inversions.

Before we present our results, we will illustrate the effects of the constraint on a simple example (Figure 15). This sequence has M=0 LL or RR sites, and N=3 LR or RL sites. Eq. (2.2) yields ZM,N=2NM![N+12]![N2]!d(M,N)=16 sequences. Here, as above, [x] means the largest integer smaller than or equal to x. However, as illustrated in Figure 15, only 8 of these sequences can be reached using the constrained inversions. We show here that this result is general, i.e. with the constrained inversion, the number of possible sequences in always equal to one half of that for the case of all inversions possible:

ZM,Nconstrained=2N-1M![N+12]![N2]!d(M,N). (A1)

Here d(M,N) is given in Eq. (2.1).

Below we will sketch the proof of Eq. (A1). Let us consider a DNA sequence that includes two LR units. It can be written as follows: >A<B>C<D>E<. Here <B> and <D> are LR, while other units can contain arbitrary combinations of units as well. <B> and <D> cannot be inverted individually without the affecting rest of the sequence. It is easy to check, by enumeration of all possible inversions, that impossible combinations satisfy a simple constraint. Let us introduce the number of reverse-compliments amongst LR units w.r.t. the initial orientation, t. Thus, for the sequence >A<D’>C<B>E< (<D’> means reverse-complement of <D>), t=1. Let us also introduce the number s, which is the number of exchanges in the <B> and <D> pair. For the sequence >A<D’>C<B>E<, s=1, while for >A<B’>C<D’>E<, s=0. It is possible to check that only the sequences for which s+t is even can be obtained from the initial sequence >A<B>C<D>E<. The sequences for which s+t is odd are not possible through the constrained inversions. This is only true for the fixed remainder of the sequence, i.e. >A<*>C<*>E<. Here ‘*’ denotes either B or D or their reverse-complement. We therefore call the number χ=s+t the index of sequence. One can obtain >A<*>C<*>E< from >A<B>C<D>E< if χ(>A<*>C<*>E<) is even.

Let us now consider the sequence with more than two LR elements >A<B>C<D>E<…>Z<. A set of sequences within LR units can be obtained by a permutation P of the original sequence. Permutations form a group of transformations called the symmetric group. Every permutation can be written as a product of several neighboring exchanges. The permutation is called even or odd if it can be written as a product of an even/odd number of neighboring exchanges. Even/odd permutations will be assigned index s equal to 0 or 1, respectively. Although there are several ways to implement P as a superposition of neighboring exchanges, they all have the same index s. The number of reverse complement LR elements can be defined as above, as well as the index χ=s+t. We showed above that a possible exchange does not change the evenness of index χ. Thus, impossible configurations are such that index χ is odd, because the original configuration has an even index. Because for unconstrained inversions both even and odd χ are possible for the same fixed residual sequence, the number of configurations is reduced by a factor of 2 in the case of constrained inversions. Therefore Eq. (A1) describes the number of configurations in the constrained case. This describes the modifications to Theorem 2.

It is possible to show that in the case of constrained inversions, properties 1–4 and Theorem 1 still hold. In the proof of Theorem 1, the first step remains the same while in the second step, we use the operation I=(c,e)(a,c) shown in Figure 16.

Figure 16.

Figure 16

Operation involving inversions only between RL SSR-sites that corresponds to Type-B inversion in the proof of Theorem 1

Theorem 3 holds as before, therefore we still have equal probability to observe all possible DNA sequences included in Eq. (A1).

References

  • 1.Lichtman JW, Livet J, Sanes JR. A technicolour approach to the connectome. Nature reviews Neuroscience. 2008 Jun;9:417. doi: 10.1038/nrn2391. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Livet J, et al. Transgenic strategies for combinatorial expression of fluorescent proteins in the nervous system. Nature. 2007 Nov 1;450:56. doi: 10.1038/nature06293. [DOI] [PubMed] [Google Scholar]
  • 3.Hampel S, et al. Drosophila Brainbow: a recombinase-based fluorescence labeling technique to subdivide neural expression patterns. Nature methods. 2011 Mar;8:253. doi: 10.1038/nmeth.1566. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Lu R, Neff NF, Quake SR, Weissman IL. Tracking single hematopoietic stem cells in vivo using high-throughput sequencing in conjunction with viral genetic barcoding. Nature biotechnology. 2011;29:928. doi: 10.1038/nbt.1977. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Oyibo HK, et al., editors. Society for Neuroscience Annual Meeting. 2011. pp. 617.25/XX57. [Google Scholar]
  • 6.Cao HHOG, Zhan H, Znamenskiy P, Koulakov A, Enquist L, Dubnau J, Zador A. Society for Neuroscience Annual Meeting. 2011. pp. 840.11/ZZ63. [Google Scholar]
  • 7.Sauer B. Functional expression of the cre-lox site-specific recombination system in the yeast Saccharomyces cerevisiae. Molecular and cellular biology. 1987 Jun;7:2087. doi: 10.1128/mcb.7.6.2087. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Sauer B, Henderson N. Site-specific DNA recombination in mammalian cells by the Cre recombinase of bacteriophage P1. Proceedings of the National Academy of Sciences of the United States of America. 1988 Jul;85:5166. doi: 10.1073/pnas.85.14.5166. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Nagy A. Cre recombinase: the universal reagent for genome tailoring. Genesis. 2000 Feb;26:99. [PubMed] [Google Scholar]
  • 10.Van Duyne GD. A structural view of cre-loxp site-specific recombination. Annual review of biophysics and biomolecular structure. 2001;30:87. doi: 10.1146/annurev.biophys.30.1.87. [DOI] [PubMed] [Google Scholar]
  • 11.Komano T. Shufflons: multiple inversion systems and integrons. Annual review of genetics. 1999;33:171. doi: 10.1146/annurev.genet.33.1.171. [DOI] [PubMed] [Google Scholar]
  • 12.Norris JR. Markov chains Cambridge series in statistical and probabilistic mathematics. Cambridge University Press; Cambridge; New York: 1997. p. xvi.p. 237. [Google Scholar]
  • 13.Horn RA, Johnson CR. Topics in matrix analysis. 1st. Cambridge University Press; Cambridge; New York: 1994. p. viii.p. 607. pbk. [Google Scholar]
  • 14.T. Komano, personal communication.

RESOURCES