Skip to main content
Biophysics logoLink to Biophysics
. 2009 May 30;5:37–44. doi: 10.2142/biophysics.5.37

Profile conditional random fields for modeling protein families with structural information

Akira R Kinjo 1,
PMCID: PMC5036637  PMID: 27857577

Abstract

A statistical model of protein families, called profile conditional random fields (CRFs), is proposed. This model may be regarded as an integration of the profile hidden Markov model (HMM) and the Finkelstein-Reva (FR) theory of protein folding. While the model structure of the profile CRF is almost identical to the profile HMM, it can incorporate arbitrary correlations in the sequences to be aligned to the model. In addition, like in the FR theory, the profile CRF can incorporate long-range pair-wise interactions between model states via mean-field-like approximations. We give the detailed formulation of the model, self-consistent approximations for treating long-range interactions, and algorithms for computing partition functions and marginal probabilities. We also outline the methods for the global optimization of model parameters as well as a Bayesian framework for parameter learning and selection of optimal alignments.

Keywords: sequence analysis, structure prediction, fold recognition, dynamic programming, mean field approximation


Protein sequence alignment is one of the most fundamental techniques in biological research. Since the early methods have been proposed13, techniques for protein sequence alignment have made a huge progress toward the detection of very weak homology4,5. Today, most advanced methods incorporate some kind of information obtained from multiple sequence alignments in terms of sequence profiles6 or position-specific scoring matrices (PSSM). In sequence profiles, such as used in PSI-BLAST7, scores for amino acid substitutions are made to be position-specific so that subtle evolutionary signals can be embedded in each site8. This in turn makes homology search more sensitive. Profile hidden Markov models (HMM)4,9 further elaborate the sequence profile methods so that deletions and insertions are also made position-specific. Although powerful, these methods do have some limitations. The profile methods (including profile HMMs) assume that each position in a profile is independent of other positions which makes it difficult to incorporate long-range correlation among different sites. The importance of long-range correlations is evident when one takes into account the tertiary structure of a protein in which residues far apart along the sequence are in contact to define the specific native structure. In practice, one can supplement a plain sequence profile with some structural information as in three-dimensional (3D) profile10 or threading11, but such combined approaches remain inherently ad hoc. In case of profile HMMs, it is extremely difficult, if not impossible, to employ such an approach since the inclusion of site-site correlations, both short-range and long-range, may break the probabilistic framework of the model.

In order to incorporate long-range correlations into an HMM-like model in a well-defined manner, we present in this paper the theoretical formulation of a model based on conditional random fields (CRF)12. Various CRF-based models have been successfully applied to many problems in biological domains including pairwise protein sequence alignment13, gene prediction14, and protein conformation sampling15, to name a few. CRFs share many of the advantages of HMMs while being able to handle site-site correlations. In the context of profile CRFs, we need to distinguish two types of site-site correlations. One is the correlations within the sequence which is to be aligned to a CRF model; the other is those among the sites within the model. The profile CRF model proposed in this paper has no limitations for incorporating the both types of correlations, although some approximations are necessary for the latter type in practical applications. Without the model sites correlations, the profile CRF model may be regarded as a generalization of the profile HMM. Unlike profile HMMs, profile CRFs can incorporate many kinds features of the sequence in terms of feature functions. With the model sites correlations, the profile CRF model may be regarded as a generalization of the self-consistent molecular field theory of Finkelstein and Reva1618, which, in turn, is a generalization of the Ising model in one-dimension (1D).

In this paper, we first present the model structure of the profile CRF, provide some examples of possible feature functions, and derive some approximations for treating long-range correlations between model sites. Next, we present algorithms for computing partition functions, marginal probabilities, and optimal alignments, followed by methods for parameter learning based on multiple sequence alignments. Since our purpose here is to present the formulation and algorithms, actual implementation of the method and experimentation thereof are left for future studies. Nevertheless, we believe that the method presented here should serve as a firm basis for the analysis of protein sequences and structures.

Theory

Profile conditional random field model

We model a protein family (or a multiple sequence alignment) in an analogous manner as profile HMMs4,9 (Fig. 1). A profile CRF model M is formally defined as a tuple of four components:

Figure 1.

Figure 1

The model structure of a profile conditional random field (CRF). Squares, diamonds, and circles are matching, insertion, and deletion states, respectively. The start and end states are labeled with “S” and “E” in the squares.

M=(M, S,F,θ) (1)

where M is the length of the model M,S = {Mk, Ik, Dk} is a set of states indexed by model sites k = 0, 1, ..., M, M+1. For each site k (1 ≤ kM), there are a matching state Mk, an insertion state Ik, and a deletion state Dk. For k = 0, there are only one matching state M0 and one insertion state I0; for k =M + 1, there is only one matching state MM+1. The matching states at the termini M0 and MM+1 are also called start state and end state, respectively, for the reason that will be apparent in the following. The model sites with k = 1, ..., M may be regarded as the core sites of the protein family. The third component, F, is a set of feature functions which are associated with model states (S). Each feature function maps an amino acid sequence and its site indexes to a real number depending on model sites. The last component, θ, is a set of parameters or external fields, each of which is associated with a feature function in F. Together with feature functions, the external fields are used for evaluating alignments between the model and amino acid sequences. The details of these terms will be clarified below. In a profile CRF model, the feature functions must be given a priori and the values of external fields are learned from a multiple sequence alignment (MSA).

The objective is to align a protein sequence x = x1 x2 ... xL (called target sequence) to the model. An alignment between a target sequence x and a CRF model is an ordered sequence of pairs of target sites and model states (called site-state pairs in the following):

A={(0,M0),(1,y1),,(i,yi),,(L+1,MM+1)} (2)

where yi = Sk ∈ {Mk, Ik, Dk}k=0, ..., M+1. The pair (i, yi) reads as “the target site i is matched to the model state yi.” It is assumed that if ij and yi = Sk and yj = Sl, then kl. An alignment always starts at the start state and ends at the end state so that the pairs (0, M0) and (L+1, MM+1) are fixed in any alignments. In an alignment, not all transitions from one site-state pair to another are possible. Allowed transitions are listed in Table 1 and depicted in Fig. 1 by arrows. By convention, the match to a delete state (i, Dk) means that the deletion resides between the i-th and (i+1)-th positions of the sequence. For example, an alignment of an 8-residue target sequence to an M=7 profile CRF model might be given as

Table 1.

Allowed transitions of site-state pairs. i and k indicate a site of a target sequence and a site of a CRF model, respectively

(i, Mk) (i + 1, Mk+1)
(i, Mk) (i + 1, Ik)
(i, Mk) (i, Dk+1)
(i, Ik) (i + 1, Mk+1)
(i, Ik) (i + 1, Ik)
(i, Ik) (i, Dk+1)
(i, Dk) (i + 1, Mk+1)
(i, Dk) (i + 1, Ik)
(i, Dk) (i, Dk+1)
A=(x,y)={(0,M0),(1,I0),(2,M1),(3,M2),(3,D3),(4,M4),(5,M5),(6,I5),(7,I5),(8,M6),(8,D7),(9,M8)}. (3)

As can be inferred from this example, (i, Mk) indicates that the i-th residue matches the k-th core site of the model, (i, Ik) indicates that there is an insertion at the i-th site in the target sequence, and (i, Dk) indicates that there is a deletion between i-th and (i + 1)-th sites of the target sequence. In terms of ordinary sequence alignment, the alignment in Eq. (3) may be expressed as

-M1M2M3M4M5--M6M7x1x2x3-x4x5x6x7x8- (4)

where the ‘–’ signs in the upper and lower rows indicate insertions (corresponding to Ik) and deletions (Dk) in the model sites.

Alignments are evaluated in terms of a set of feature functions F={sSα,sS,Sβ,μS,Sγ}. Three types of feature functions are distinguished, namely, singlet feature functions sSα(x,i), doublet feature functions tS,Sβ(x,i), and pairwise feature functions uS,S(x, i, j). The singlet feature function (SFF) sSα(x,i) is a real-valued function representing some feature α of the target sequence when yi = S; the doublet feature function (DFF) tS,Sβ(x,i) is also a real-valued function representing some feature β when yi = S and yi = S′. Here, i is the predecessor of i defined as

i-={i(if yi=Dk),i-1(if yi=Mkor Ik). (5)

In general, α may depend on S and β may depend on S and S′. The singlet and doublet feature functions are called local or short-ranged since the former represents interactions at one model site and the latter, interactions between two adjacent model sites for which transitions are allowed. The pairwise feature function (PFF) μS,Sγ(x,i,j), representing some feature γ, is defined for yi = S and yi = S′. While singlet and doublet feature functions are local, pairwise feature functions are non-local in the sense that S and S′ can be any pair of the model states, not necessarily those for which direct transitions are allowed.

Each of singlet, doublet or pairwise feature functions is coupled with a parameter called an external field: λSα for sSα,μS,Sβ for tS,Sβ, and vS,Sγ for uS,Sγ. That is, θ={λSα,μS,Sβ,vS,Sγ}. The product of a feature function and its coupled external field yields the score of the corresponding feature when a particular target site is aligned to a model state. For example, the product λSαsSα(x,i) is the score of the feature α when the target site i is aligned to the model site S. In the formulation of CRF, it is convenient to employ an analogy to statistical physics. Thus, the negative total score of an alignment is interpreted as the total energy, and the normalization factor for the conditional probability of alignments as the partition function of the target sequence.

Given an alignment between the model and the sequence, the total energy of an alignment A = (x, y) = {(0, M0), ..., (i, yi), ..., (L+1, MM+1)} is defined by

E(y,x,θ)=-{i}[αλyiαsyiα(x,i)+βμyi-,yiβtyi-,yiβ(x,i)]-{i<j}γvyi,yjγμyi,yjγ(x,i,j) (6)

where the summation over {i} means summing along the alignment (x, y) (there can be multiple occurrences of the same index i due to the matching to deletion states); the double summation for i< j is also similarly defined. The partition function of this system is thus given by

Z(x,θ)={y}exp [-E(y,x,θ)/T] (7)

where the summation is over all possible alignments, and T is the temperature (in energy unit). The conditional probability of obtaining a particular alignment A = (x, y) for a given x is

P(yx,θ)=exp [-E(y,x,θ)/T]Z(x,θ) (8)

which is also called the likelihood of the alignment in the following. The log-likelihood is defined by

L(θx,y)=log P(yx,θ)=-E(y,x,θ)/T-log Z(x,θ). (9)

From here on, we assume T = 1 unless otherwise stated. The derivatives of the log-likelihood with respect to the parameters, ∂L/∂θ, are useful both for parameter learning and for deriving approximations. For singlet terms, they are given as

L(θx,y)λSα={i}sSα(x,i)[δS,yi-P(Sx,i)] (10)

where δS,yi, is Kronecker’s delta and P(S|x, i) is the marginal probability that i-th site of the target sequence is aligned to the state S of the model, i.e., yi = S. Similarly for the doublet terms,

L(θx,y)μS,Sβ={i}tS,Sβ(x,i)[δS,yi-δS,yi-P(S,Sx,i)] (11)

where P(S, S′|x, i) is the marginal probability that yi = S and yi = S′. Finally for the pairwise terms,

L(θx,y)vS,Sγ={i<j}uS,Sγ(x,i,j)[δS,yiδS,yj-P(S,Sx,i,j)] (12)

where P(S, S′|x, i, j) is the marginal probability that yi = S and yj = S′. Either when parameters are optimal for a given alignment or when the alignment is optimal for given parameters, we have ∂L/∂θ = 0.

Feature functions

Although we are focused on the formulation of the profile CRF model, it is instructive to provide some concrete examples for feature functions. It should be stressed, however, that the actual selection of feature functions will require careful experimentation to maximize the effectiveness of the profile CRF framework.

Singlet feature functions

Singlet feature functions represent compatibility measures between a model state and a target sequence. It may depend on the whole target sequence as well as on single amino acid residues. One simple SFF may be such that

sMkR(x,i)=δxi,R (13)

where R is one of the 20 standard amino acid residue types. It is implicitly assumed that this feature function is defined only when yi = Mk. The same assumption applies throughout the following discussion.

If the target sequence is accompanied by its PSSM, the above SFF (Eq. 13) can be generalized as

sMkPSSM(R)(x,i)=PSSM(i,R) (14)

where PSSM(i,R) is the value of the PSSM for residue type R at site i.

SFFs can also depend on multiple sites of the target sequence. For example, let us partition amino acid residues into either hydrophobic (1) or hydrophilic (0). Let b7(x, i) be a binary word encoding19 function of the 7-residue sub-sequence xi−3 ... xi+3. Then, the SFF

sIk0000000(x,i)=δ0000000,b7(x,j) (15)

may enhance insertions at highly hydrophilic regions of the target sequence. Similarly, the SFF

sIk0011011(x,i)=δ0011011,b7(x,i) (16)

may enhance the matching at α-helical regions since the binary pattern 0011011 is typical in α helices. There are 27 = 128 types of binary words for 7-residue segments, and we can incorporate all of these in a single profile CRF model.

If either predicted or observed structural information is available for the target sequence, we may define, for example,

sMkH(x,i)=δH,SS(i) (17)

where SS(i) indicates the secondary structure of site i.

Doublet feature functions

Doublet feature functions represent the feasibility of transitions from one site-state pair to another. One trivial example is those that do not depend on the target sequence at all. For example, the DFF

tMk,Ik-(x,i)=1 (18)

may be regarded as a feature representing a gap (insertion) opening. Similar sequence-independent DFFs can be defined for all the allowed state transitions.

Of course, DFFs can be made to be target sequence-dependent. Take the binary word encoding example again. For example, the following DFF

tMk,Dk+1001101(x,i)=δ001101,b6(x,i) (19)

may help to suppress deletions at α-helical regions, since the pattern 001101 is typical in α-helices in which deletions are less likely to occur.

Pairwise feature functions

With pairwise feature functions, it is possible to incorporate some kind of correlations between two states that are not directly connected by transitions. Such correlations are most easily grasped in the context of the tertiary structure of a protein. Suppose that there is a known structure in a protein family to be modeled as a profile CRF, and that structure contains a pair of contacting residues which correspond to the matching states Mk and Ml. We may define

uMk,Mlcontact(R,R)(x,i,j)=δxi,Rδxj,R (20)

where R and R′ are amino acid residue types. We can define different PFFs for different kinds of interactions such as hydrogen bonds, salt bridges, hydrophobic contacts, etc. Also, sequence-dependence may be made more complex. We can combine contacts with binary word encoding, for example.

Approximations for pairwise interactions

If there are no pairwise terms, exact partition functions and exact optimal alignments for profile CRF models can be computed efficiently by dynamic programming just as in profile HMMs. With pairwise terms present, however, the computation of exact solutions is intractable. In order to make computations feasible, we need to make some approximations. More specifically, we will derive a Bethe approximation, which is further simplified to a mean-field approximation.

Observe, first, that the pairwise terms can be rearranged as

{i<j}γvyi,yjγuyi,yjγ(x,i,j)={i<j}γS,SvS,SγuS,Sγ(x,i,j)δS,yiδS,yj. (21)

When the alignment is optimal, we have L(θx,y)/vS,Sγ=0 (Eq. 12), hence the following:

{i<j}uS,Sγ(x,i,j)δS,yiδS,yj={i<j}uS,Sγ(x,i,j)P(S,Sx,i,j). (22)

Using this relation, the pairwise terms are arranged as

{i<j}γ,S,SvS,SγuS,Sγ(x,i,j)δS,yiδSyj={i}γ,Su˜Sγ(x,i)P(Sx,i). (23)

where u˜Sγ(x,i) is the renormalized singlet feature function defined by

u˜Sγ(x,i)=12{j}SvS,SγuS,Sγ(x,i,j)P(SS,x,i,j). (24)

The conditional marginal probability P(S′|S,x, i, j) is defined by

P(SS,x,i,j)=P(S,Sx,i,j)P(Sx,i). (25)

Using u˜Sγ(x,i)

and introducing a coupled external field v˜Sγ, let us define a tentative total energy:

E˜(y,x,θ)=-{i}[αλyiαsyiα(x,i)+γv˜yiγu˜yiγ(x,i)+βμyi-,yiβtyi-,yiβ(x,i)]. (26)

By calculating the log-likelihood (Eq. 9) based on this energy and its derivative with respect to v˜Sγ (Eq. 10), and enforcing the optimality condition L(θx,y)/v˜Sγ=0, we have

{i}u˜Sγ(x,i)δS,yi={i}u˜Sγ(x,i)P(Sx,i). (27)

Substituting this relation into Eq. (23), we have

{i<j}γvyi,yjγuyi,yjγ(x,i,j)={i}γu˜yiγ(x,i). (28)

Therefore, the pairwise energy terms can be converted into renormalized singlet energy terms as long as the alignment is optimal. For non-optimal alignments, we approximate the total energy by

E(y,x,θ)E˜(y,x,θ) (29)

with v˜Sγ=1. The renormalized singlet feature function (Eq. 28) explicitly accounts for the pairwise joint probability, and hence it may be called a Bethe or quasi-chemical approximation. Furthermore, if we assume two alignment sites are independent, we can decouple the joint marginal probability as

P(S,Sx,i,j)P(Sx,i)P(Sx,j). (30)

This is a mean-field approximation. Substituting this into Eqs. (25, 28), we have the following mean-field energy:

u˜Sγ(x,i)12{j}SvS,SγuS,Sγ(x,i,j)P(Sx,j). (31)

An advantage of this approximation is that we need not to compute the joint marginal probabilities. By using either the Bethe (Eq. 24) or the mean-field (Eq. 31) approximations, the energy of the alignment is expressed as

E(y,x,θ)-{i}[αλyiαsyiα(x,i)+γu˜yiγ(x,i)+βμyi-,yiβtyi-,yiβ(x,i)]. (32)

Note that there are apparently no external field parameters for the renormalized SFFs ( u˜Sγ(·)); they are included in the definitions (Eqs. 24, 31). Since the mean-field feature functions are effectively singlet feature functions, we can apply the standard procedure for learning and alignment, provided that the mean-fields are known. Of course, the mean-fields are not known in advance so that we need to obtain the partition function by an iterative procedure. That is,

  1. Arbitrarily set u˜Sγ(·).

  2. Calculate the partition function and marginal probabilities based on the previously calculated u˜Sγ(·).

  3. Based on the partition function and marginal probabilities in the previous step, update u˜Sγ(·) by Eq. (24) or Eq. (31).

  4. Iterate steps 2 and 3 until convergence.

The algorithms for computing the partition function and marginal probabilities are a subject of the next section.

Algorithms for alignment and learning

Computation of partition function, marginal probabilities and optimal alignment

The partition function (Eq. 7) and marginal probabilities can be calculated efficiently by dynamic programming (or transfer matrix method). In this section, we assume that pairwise terms are approximated as renormalized SFFs (Eqs. 24, 31), and they are treated as ordinary SFFs. First we define the transfer matrix:

Ti(S,S)=exp[ei(S,S)/T] (33)

where S′, SS, T is the temperature, and

ei(S,S)=αλSαsSα(x,i)+βμS,SβtS,Sβ(x,i). (34)

The partition function (Eq. 7) is then expressed as

Z(x)={y}{i}Ti(yi-,yi) (35)

where the summation is over all possible model states of each residue of the target sequence. In order to compute the partition function Eq. (35), we define an auxiliary function Zi,j (Sk, Sl) where i, j = 0, ..., L+1 and Sk ∈ {Mk, Ik, Dk}, Sl ∈ {Ml, Il, Dl}. Zi, j(Sk, Sl) is the partition function of the subsequence xi xi+1 ... xj where its termini i and j are fixed to the model states Sk and Sl, respectively. These conditions are given as

Zi,i(Sk,S)=δSk,S, (36)
Zj,j(S,Sl)=δS,Sl. (37)

By the construction of the model, the following boundary conditions hold in particular:

Z0,0(M0,M0)=1, (38)
ZL+1,L+1(MM+1,MM+1)=1. (39)

The partition function Z(x) is given as

Z(x)=Z0,L+1(M0,MM+1). (40)

Based on the boundary condition Eq. (36), the following forward recurrence equations for Zi, j(Sk,Sl) hold for j=i, ..., L+1 and l = k, ..., M+1:

Zi,j(Sk,Sl)=Zi,j-1(Sk,Ml-1)Tj(Ml-1,Ml)+Zi,j-1(Sk,Il-1)Tj(Il-1,Ml)+Zi,j-1(Sk,Dl-1)Tj(Dl-1,Ml); (41)
Zi,j(Sk,Il)=Zi,j-1(Sk,Ml)Tj(Ml,Il)+Zi,j-1(Sk,Il)Tj(Il,Il)+Zi,j-1(Sk,Dl)Tj(Dl,Il); (42)
Zi,j(Sk,Dl)=Zi,j(Sk,Ml-1)Tj(Ml-1,Dl)+Zi,j(Sk,Il-1)Tj(Il-1,Dl)+Zi,j(Sk,Dl-1)Tj(Dl-1,Dl). (43)

It is understood that the terms involving non-existent states and/or incompatible state transitions (e.g, Z1,1(M0, D0), Z1,0(I0, I0), etc.) are ignored. Similarly, together with the boundary condition Eq. (37), the backward recurrence equations are given for i = j, ..., 0 and k = l, ..., 0 as

Zi,j(Mk,Sl)=Ti+1(Mk,Mk+1)Zi+1,j(Mk+1,Sl)+Ti+1(Mk,Ik)Zi+1,j(Ik,Sl)+Ti(Mk,Dk+1)Zi,j(Dk+1,Sl); (44)
Zi,j(Ik,Sl)=Ti+1(Ik,Mk+1)Zi+1,j(Mk+1,Sl)+Ti+1(Ik,Ik)Zi+1,j(Ik,Sl)+Ti(Ik,Dk+1)Zi,j(Dk+1,Sl); (45)
Zi,j(Dk,Sl)=Ti+1(Dk,Mk+1)Zi+1,j(Mk+1,Sl)+Ti+1(Dk,Ik)Zi+1,j(Ik,Sl)+Ti(Dk,Dk+1)Zi,j(Dk+1,Sl). (46)

For convenience, let us define the forward auxiliary function Fi(Sk) and the backward auxiliary function Bi (Sk) by

Fi(Sk)=Z0,i(M0,Sk), (47)
Bi(Sk)=Zi,L+1(Sk,MM+1). (48)

Using Fi and Bi, and Zi, j, we can calculate marginal probabilities. The joint marginal probability is obtained as

P(Sk,Slx,i,j)=Fi(Sk)Zi,j(Sk,Sl)Bj(Sl)Z(x). (49)

In particular, when i = j and Sk=Sl, we have

P(Skx,i)=Fi(Sk)Bi(Sk)Z(x). (50)

Similarly, for states with allowed transitions S and S′ (Table 1),

P(S,Sx,i)=Fi-(S)Ti(S,S)Bi(S)Z(x). (51)

Using these marginal probabilities, the renormalized SFFs for pairwise terms (Eqs. 24, 31) can be computed.

The optimal alignment for a given model and a target sequence is the one that yields the minimum energy, which corresponds to the free energy of the system at zero temperature (T=0). The recurrence equations for the optimal alignment can be derived as the zero-temperature limit of the forward recurrence equations using the following formula20:

lim+0log[ieai/]=maxiai. (52)

That is, if we define a function

Ai(Sk)=limT0[Tlog Fi(Sk)], (53)

the energy of the optimal alignment A = (x, yopt) is given by

E(yopt,x)=-AL+1(MM+1). (54)

More concretely, we first set the boundary condition

A0(M0)=0, (55)

and apply the zero-temperature limit to the both sides of the forward recurrence equations for Fi(Sk)=Z0,i(M0, Sk) (Eqs. 4143), we have

Ai(Mk)=max{[Ai-1(Mk-1)+ei(Mk-1,Mk)],[Ai-1(Ik-1)+ei(Ik-1,Mk)],[Ai-1(Ik-1)+ei(Dk-1,Mk)]}; (56)
Ai(Ik)=max{[Ai-1(Mk)+ei(Mk,Ik)],[Ai-1(Ik)+ei(Ik,Ik)],[Ai-1(Dk)+ei(Dk,Ik)]}; (57)
Ai(Dk)=max{[Ai(Mk-1)+ei(Mk-1,Dk)],[Ai(Ik-1)+ei(Ik-1,Dk)],[Ai(Dk-1)+ei(Dk-1,Dk)]}. (58)

By tracing back the site-state pairs that yield the optimal values of Ai(Sk) at each step, we can find the optimal alignment yopt.

Parameter learning with multiple sequence alignment

Global optimization of parameters

The parameters of a profile CRF are the set of external fields λSα,μS,Sβ and vS,Sγ (of course, we need to specify the feature functions to start to with). The input for parameter learning is a multiple sequence alignment (MSA) of a protein family, from which the model architecture must be somehow specified “by hand.” In this process, we need to specify which columns of the MSA correspond to matching states. After the columns of matching states are determined, matching, insertion and deletion states can be assigned to each column of each sequence in the MSA.

After the model architecture has been determined, the learning can be done by maximizing the likelihood using the sequences of the input MSA. Let these alignments be (x(p), y(p)) where p = 1, ..., n is the index of sequences. The joint log-likelihood is given by

L(θ{x(p),y(p)})=-k=1n[E(y(p),x(p),θ)+log Z(x(p),θ)]. (59)

Since the total energy is a linear function of the parameters, and −log Z is the free energy of the system which is always convex, the total log-likelihood is also a convex function of the parameters. This implies that we can obtain the globally optimal parameter sets by gradient-based methods. In practice, minimizing the bare log-likelihood may results in over-fitting of the parameters to the training set. Therefore, we define an alternative objective function K(θ|{x(p), y(p)}) which includes prior probability density of the parameters for regularization:

K(θ{x(p),y(p)})=L(θ{x(p),y(p)})+log P(θ) (60)

where P(θ) is a Gaussian prior:

P(θ)=α,Sexp[-(λSα)22(σSα)2]β,S,Sexp[-(μS,Sβ)22(σS,Sβ)2]×γ,S,Sexp[-(vS,Sβ)22(σS,Sγ)2]. (61)

Here, the hyper-parameters σSα,σS,Sβ and σS,Sγ are the (expected) standard deviations of the corresponding external fields, and must be specified a priori (however, if we use a hierarchical Bayes model, these hyper-parameters can be automatically adjusted based on the training data). Since we can calculate the gradient of this log-likelihood, it is possible to use gradient-based optimization techniques. Since K(θ|{x(p), y(p)}) (Eq. 60) is still convex, the globally optimal parameters are guaranteed to be found by gradient descent methods.

Bayesian learning

It is also possible to apply the Bayesian learning framework21. That is, instead of using a single, globally optimal parameter set, we can use a set of suboptimal parameters to make robust predictions. From Bayes’ formula, we have

P(θ{x(p),y(p)})exp[L(θ){x(p),y(p)}]P(θ). (62)

Using this equation, a Bayesian alignment for the target sequence x may be selected so as to maximize the following probability:

P(yx,{x(p),y(p)})=P(yx,θ)P(θ{x(p),y(p)})dθ. (63)

Suboptimal parameters may be obtained by Markov chain Monte Carlo simulations in the θ-space, using −log K(θ|{x(p), y(p)}) as the “energy” of the system. Since the gradients of the log-likelihood can be computed, a hybrid Monte Carlo method is also at our disposal for efficient sampling.

We can also employ hierarchical Bayes learning which can automatically adjust the the hyper-parameters for the prior, σSα and σS,Sβ, based on the training set21.

Discussion

In this paper, we have formulated the profile CRF to model protein families with possible long-range correlations such as structural information. The profile CRF model is clearly an extension of both the molecular field theory of Finkelstein and Reva (FR theory)1618 and the profile HMM4,9, and hence an integration of these. Here, we shall discuss the relationship of the present model with these two earlier models.

The FR theory is particularly focused on 3D structures of proteins. Accordingly, its model is explicitly represented in the 3D space as a set of lattice points. The lattice points mostly correspond to residues in secondary structure elements (SSEs), and these points may be regarded as “match” states in the present framework. The FR model does not allow gaps within each SSE, only insertions are allowed in the regions between two SSEs. The energy functions (≈ feature functions) are physics-based ones, and the parameters are not optimized to fit some training data, but obtained from physical experiments. Therefore, the FR models are more suitable for studying physical aspects of protein folding and structure prediction, but less so for more general-purpose sequence analysis. Nevertheless, almost all the theoretical foundations of the FR theory such as calculation of partition functions, marginal probabilities, mean-field approximations, but except for parameter learning, are shared by profile CRFs. After all, the both models are extensions of the 1D Ising model.

The analogy between 1D Ising model and a more general sequence alignment problem was pointed out by Miyazawa22, which was further extended to the problem of sequence-structure alignment with a mean-field approximation23. Later, Koike et al.24 applied this analogy to compute partition functions and marginal probabilities in protein structure comparison with the Bethe approximation. By complementing the FR theory with these techniques, the alignment algorithm can be made more general, and one such generalization is the profile CRF model. The improvements made by profile CRFs on the FR theory are thus clear: more general treatment of model states, possible insertions and deletions at any sites, and parameter learning based on MSA.

Profile HMMs, being a class of generative models, need to calculate the joint probability of alignment P(x,y) while profile CRFs, being a class of discriminative model, directly calculates the conditional probability P(y|x). In special cases, with the definition of the conditional probability P(y|x) = P(x, y)/P(x) in mind, we may regard Z(x) as P(x) and exp[−E(y,x)] as P(x,y). More specifically, if we define only the following feature functions (and no others) with appropriate values for external fields, we can construct a CRF that is equivalent to a given HMM:

  1. Define singlet feature sSkx functions for matching and insertion states as in Eq. (13). For deletion states, just define a constant SFF (always equal to 1).

  2. Define sequence-independent feature functions tSk,Sl- for each transitions as in Eq. (18).

  3. Set the singlet external fields as λSkR=log qSk(R)(qSk(R):the emission probability of the HMM).

  4. Set the doublet external fields as μSk,Sl-=log pSk,Sl(pSk,Sl:transition probability of the HMM).

However, this equivalence breaks down as soon as we incorporate other feature functions into profile CRFs since the Boltzmann factor exp[−E(y,x)] may no longer satisfy a condition of probability measure (i.e., normalization to 1). Thus, HMMs are a very special class of CRFs.

In summary, we have presented the profile CRF model. This model is flexible enough to accommodate almost any features of target sequences including PSSM, local sequence patterns, and even long-range correlations. It can also incorporate various features of a modeled protein family such as local structures and long-range pairwise interactions. Although concrete implementations are yet to be done, we expect this model to be a useful alternative to conventional methods for analyzing and understanding protein sequences and structures.

Acknowledgments

The author thanks Drs. Takeshi Kawabata, Ryotaro Koike, Kengo Kinoshita and Motonori Ota for their valuable comments on the first draft of this manuscript.

References

  • 1.Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48:443–453. doi: 10.1016/0022-2836(70)90057-4. [DOI] [PubMed] [Google Scholar]
  • 2.Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;147:195–197. doi: 10.1016/0022-2836(81)90087-5. [DOI] [PubMed] [Google Scholar]
  • 3.Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol. 1982;162:705–708. doi: 10.1016/0022-2836(82)90398-9. [DOI] [PubMed] [Google Scholar]
  • 4.Durbin R, Eddy R, Krogh A, Mitchison G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge Univ. Press; Cambridge, U.K: 1999. [Google Scholar]
  • 5.Eidhammer I, Jonassen I, Taylor WR. Protein bioinformatics. Wiley & Sons; Chichester, England: 2004. [Google Scholar]
  • 6.Gribskov M, McLachlan AD, Eisenberg D. Profile analysis: Detection of distantly related proteins. Proc Natl Acad Sci USA. 1987;84:4355–4358. doi: 10.1073/pnas.84.13.4355. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped blast and PSI-blast: A new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Kinjo AR, Nakamura H. Nature of protein family signatures: Insights from singular value analysis of position-specific scoring matrices. PLoS ONE. 2008;3:e1963. doi: 10.1371/journal.pone.0001963. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Krogh A, Brown M, Mian IS, Sjölander K, Haussler D. Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol. 1994;235:1501–1531. doi: 10.1006/jmbi.1994.1104. [DOI] [PubMed] [Google Scholar]
  • 10.Bowie JU, Lüthy R, Eisenberg D. A method to identify protein sequences that fold into a known three-dimensional structure. Science. 1991;253:164–170. doi: 10.1126/science.1853201. [DOI] [PubMed] [Google Scholar]
  • 11.Jones DT, Taylor WR, Thornton JM. A new approach to protein fold recognition. Nature. 1992;358:86–89. doi: 10.1038/358086a0. [DOI] [PubMed] [Google Scholar]
  • 12.Lafferty J, McCallum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proc Int Conf Machine Learning. 2001:282–289. [Google Scholar]
  • 13.Do C, Gross S, Batzoglou S. CONTRAlign: Discriminative training for protein sequence alignment. Proceedings of the Tenth Annual International Conference on Computational Molecular Biology (RECOMB 2006) 2006:160–174. [Google Scholar]
  • 14.DeCaprio D, Vinson JP, Pearson MD, Montgomery P, Doherty M, Galagan JE. Conrad: Gene prediction using conditional random fields. Genome Res. 2007;17:1389–1398. doi: 10.1101/gr.6558107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Zhao F, Li S, Sterner BW, Xu J. Discriminative learning for protein conformation sampling. Proteins. 2008;73:228–240. doi: 10.1002/prot.22057. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Finkelstein AV, Reva BA. Search for the most stable folds of protein chains. Nature. 1991;351:497–499. doi: 10.1038/351497a0. [DOI] [PubMed] [Google Scholar]
  • 17.Finkelstein AV, Reva BA. Search for the most stable folds of protein chains: I. application of a self-consistent molecular field theory to a problem of protein three-dimensional structure prediction. Protein Eng. 1996;9:387–397. doi: 10.1093/protein/9.5.387. [DOI] [PubMed] [Google Scholar]
  • 18.Finkelstein AV, Reva BA. Search for the most stable folds of protein chains: II. computation of stable architectures of beta-proteins using a self-consistent molecular field theory. Protein Eng. 1996;9:399–411. doi: 10.1093/protein/9.5.399. [DOI] [PubMed] [Google Scholar]
  • 19.Kawabata T, Doi J. Improvement of protein secondary structure prediction using binary word encoding. Proteins. 1997;27:36–46. doi: 10.1002/(sici)1097-0134(199701)27:1<36::aid-prot5>3.0.co;2-l. [DOI] [PubMed] [Google Scholar]
  • 20.Hirota R, Takahashi D. Discrete and Ultradiscrete Systems. Kyoritsu Shuppan Co; Tokyo, Japan: 2003. In Japanese. [Google Scholar]
  • 21.Neal RM. Number 118 in Lecture Notes in Statistics. Springer-Verlag; New York, U.S.A: 1996. Bayesian Learning for Neural Networks. [Google Scholar]
  • 22.Miyazawa S. A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng. 1995;8:999–1009. doi: 10.1093/protein/8.10.999. [DOI] [PubMed] [Google Scholar]
  • 23.Miyazawa S, Jernigan RL. Identifying sequence-structure pairs undetected by sequence alignments. Protein Eng. 2000;13:459–475. doi: 10.1093/protein/13.7.459. [DOI] [PubMed] [Google Scholar]
  • 24.Koike R, Kinoshita K, Kidera A. Probabilistic description of protein alignments for sequences and structures. Proteins. 2004;56:157–166. doi: 10.1002/prot.20067. [DOI] [PubMed] [Google Scholar]

Articles from Biophysics are provided here courtesy of The Biophysical Society of Japan

RESOURCES