Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2020 Nov 30;16(11):e1008085. doi: 10.1371/journal.pcbi.1008085

Remote homology search with hidden Potts models

Grey W Wilburn 1, Sean R Eddy 2,3,*
Editor: Sushmita Roy4
PMCID: PMC7728182  PMID: 33253143

Abstract

Most methods for biological sequence homology search and alignment work with primary sequence alone, neglecting higher-order correlations. Recently, statistical physics models called Potts models have been used to infer all-by-all pairwise correlations between sites in deep multiple sequence alignments, and these pairwise couplings have improved 3D structure predictions. Here we extend the use of Potts models from structure prediction to sequence alignment and homology search by developing what we call a hidden Potts model (HPM) that merges a Potts emission process to a generative probability model of insertion and deletion. Because an HPM is incompatible with efficient dynamic programming alignment algorithms, we develop an approximate algorithm based on importance sampling, using simpler probabilistic models as proposal distributions. We test an HPM implementation on RNA structure homology search benchmarks, where we can compare directly to exact alignment methods that capture nested RNA base-pairing correlations (stochastic context-free grammars). HPMs perform promisingly in these proof of principle experiments.

Author summary

Computational homology search and alignment tools are used to infer the functions and evolutionary histories of biological sequences. Most widely used tools for sequence homology searches, such as BLAST and HMMER, rely on primary sequence conservation alone. It should be possible to make more powerful search tools by also considering higher-order covariation patterns induced by 3D structure conservation. Recent advances in 3D protein structure prediction have used a class of statistical physics models called Potts models to infer pairwise correlation structure in multiple sequence alignments. However, Potts models assume alignments are given and cannot build new alignments, limiting their use in homology search. We have extended Potts models to include a probability model of insertion and deletion so they can be applied to sequence alignment and remote homology search using a new model we call a hidden Potts model (HPM). Tests of our prototype HPM software show promising results in initial benchmarking experiments, though more work will be needed to use HPMs in practical tools.

Introduction

An important task in bioinformatics is determining whether a new sequence of unknown biological function is evolutionarily related, or homologous, to other known sequences or families of sequences. Critical to the concept of homology is alignment: homology tools create multiple sequence alignments (MSAs) in which evolutionarily related positions are aligned in columns by inferring patterns of sequence conservation induced by complex evolutionary constraints maintaining the structure and function of the sequence [1].

Homology search tools are used in a wide range of biological problems, but it is common for these tools to fail to identify distantly related sequences; many genes evolve quickly enough that homologs may exist yet be undetectable [2]. More powerful and sensitive homology search tools are therefore needed.

One possible way to improve the sensitivity of homology search and alignment is to develop new methods that successfully capture patterns of residue correlation induced by 3D structural constraints. State-of-the-art homology search methods do not model certain important elements of structure-induced conservation. Methods such as BLAST and HMMER, the latter of which uses profile hidden Markov models (pHMMs), align and score sequences using primary sequence conservation alone [35]. Specific methods for RNA homology search, such as the software package Infernal, use a class of probabilistic models called profile stochastic context free grammars (pSCFGs) to infer primary and secondary structure conservation in RNA [6, 7]. However, Infernal is limited to nested, disjoint pairs of nucleotides, meaning it cannot capture complicated 3D RNA structural elements like pseudoknots and base triples, let alone complex correlation structure in protein MSAs.

An opportunity to build a more sensitive sequence homology search method has arisen via two key developments. First, there has been recent exciting progress in exploring pairwise correlations between columns in protein and RNA MSAs using Potts model (aka Markov random field) methods from statistical physics [814]. Previous work has applied Potts models to homology search and alignment problems with specific proteins, but not to biological sequences in general [1521]. Potts models have also been used to study protein-protein interactions [2225], mutational effects [2630], cellular morphogenesis [31], and collective neuron function [32]. Building upon previous methods that use pairwise sequence correlation to infer conserved base pairs in RNA structure and 3D structure in proteins [33, 34], a Potts model expresses the probability that a particular sequence belongs to a family represented by an MSA as a function of all possible characters (amino acids or nucleotides) at each position and all possible pairs of characters across all positions.

Second, very large datasets of aligned homologous protein and RNA sequences are available, such as Pfam for protein and Rfam for RNA [35, 36]. Deep MSAs allow for new methods that analyze more subtle patterns of sequence conservation.

Potts models have two drawbacks for applications in sequence homology search. First, Potts models assume fixed length sequences in given MSAs, treating insertions and deletions as an extra character (5th nucleotide or 21st amino acid). Homology search involves scoring variable length sequences and inferring alignments. Second, models such as pHMMs and pSCFGs use efficient dynamic programming algorithms that align and score sequences in polynomial time. Unfortunately, the all-by-all pairwise nature of Potts models makes them incompatible with dynamic programming algorithms and therefore impractical for homology search.

It would be useful to have a homology search tool that can both handle higher-order sequence correlation patterns, as a Potts model can, and efficiently align and score variable length sequences, as BLAST, HMMER, and Infernal are able to do. Inspired by early work using Potts models for protein alignment and the success of Potts models in structure prediction, we have developed a class of models called hidden Potts models (HPMs) that combines the coupled Potts emission process with a probabilistic insert-deletion model. In addition, we have developed an approximate algorithm that uses importance sampling to align and score variable length sequences to a parameterized HPM.

We have built a proof of principle HPM software implementation. To test the efficacy of HPMs in homology search and alignment, we compare our software to HMMER and Infernal in RNA remote homology benchmark tests based on trusted, hand-curated non-coding RNA alignments. RNA is an ideal testbed, as secondary and tertiary structure are largely dictated by stereotyped base-pairing interactions, making model parameterization interpretable. Also, pSCFGs capture nested pairwise correlations, a key part of RNA homology search, leading to a natural comparison to an existing approach that captures much of the pairwise residue correlation structure in conserved RNAs. As HPMs capture nested and non-nested pairwise correlations, we expect them to at least equal, if not outperform, pSCFGs in RNA remote homology search and alignment.

Results

Hidden Potts models emit variable length sequences

A generative homology model describes the probability of creating an unaligned sequence of L observed characters, x. In the case of biological sequences, characters in x represent monomer units in a specific alphabet (20 amino acids for protein, 4 nucleotides for DNA and RNA).

A Potts model expresses the probability of a homologous sequence as a function of primary conservation and all possible pairwise correlations between all consensus sites in a biological sequence (i.e., consensus columns in a multiple sequence alignment, where we define consensus columns as those having fewer than 50% gap characters). Going beyond a Potts model, a hidden Potts model consists of M sites called match states (squares in Fig 1), corresponding to well-conserved columns in an MSA, and intermediate insert states (diamonds) that handle variable-length insertions between match states. An HPM has a fixed number of match states, which restricts the model to glocal alignment (local to the sequence but global to the model).

Fig 1. Hidden Potts model architecture.

Fig 1

Squares are conserved match states and diamonds are insert states. No delete states exist. Silent begin and end states are represented by circles. An HPM is a hybrid between a Potts model and a pHMM: correlated character generation (including deletion “characters” rather than delete states) in match columns (consensus sites in an MSA) comes from a Potts distribution (dotted orange arrows), while transition probabilities linking states and site-independent character emissions in unaligned insert columns come from a pHMM (solid and dashed black arrows, respectively).

An HPM is a hybrid between a profile hidden Markov model (pHMM) [4] and a Potts model. Much like a pHMM, an HPM consists of an observed sequence x and a hidden state path σ representing the alignment of x to the model. σ is a Markov chain, with the individual states mapping to columns in a multiple sequence alignment. Each state emits an observed character, with all emissions combining to form x.

Unlike a pHMM, HPM states do not emit characters independent of other states, and deletions relative to the model’s consensus are handled differently. HPM match states generate a set of characters with a single, correlated emission from a Potts model. As Potts models can only handle fixed length sequences, HPMs do not model the lack of a character in a consensus column with a special delete state in σ, but rather as the emission of a “dummy” deletion character from a match state. The deletion character effectively serves as an extra character in the alphabet (21st amino acid or 5th nucleotide). As such, we divide x into two non-contiguous subsequences: xm, the characters emitted from match states (which may include deletion characters); and xi, the characters generated by insert states (all of which represent monomers). In this notation, m signifies “match” and i represents “insert”. A pHMM path, which we represent as π and can include delete states, is not identical to a corresponding HPM path: a single HPM path σ can map to up to 2M possible paths under a pHMM, representing the possible distribution of deletions in consensus columns.

Mathematically, σ, xm, and xi are nuisance variables that must be marginalized over to obtain P(x). The joint probability of x, xm, xi, and σ under an HPM can be factorized into two terms.

P(x,xm,xi,σ)=P(x|xm,xi,σ)P(xm,xi,σ) (1)

Given σ, xm and xi can be dealigned to produce a unique, unaligned sequence x; P(x|xm,xi,σ) is 1 if xm, xi, and σ can be combined to produce unaligned sequence x and 0 otherwise.

Each of xm, xi, and σ is generated differently under an HPM. Their joint probability can be factored into three terms: Pt(σ), a product of transition probabilities linking states to one another (solid black arrows in Fig 1); PPotts(xm), describing the Potts emission of characters from match states (dashed orange arrows); and Pi(xi|σ), a product of independent emission of characters from insert states (dashed black arrows). In this notation, t is shorthand for “transition.”

P(xm,xi,σ)=PPotts(xm)Pi(xi|σ)Pt(σ) (2)

The entire match sequence xm is generated by one multi-character emission from the Potts distribution. The Potts distribution assumes the probability of generating a sequence from match states is given by an exponential Boltzmann distribution:

PPotts(xm)=1ZeH(xm) (3)

Here H(xm) is referred to as the Hamiltonian, while Z is the partition function, a normalization constant. The Hamiltonian is given by:

H(xm)=k=1Mhk(xmk)+k=1Ml=k+1Mekl(xmk,xml) (4)

hk(xmk) represents the single position preference for character xmk in match state mk. ekl(xmk,xml) corresponds to the statistical coupling between character preferences at separate match columns k and l. These are the Potts model parameters that we estimate from an input MSA.

The insert emission probability factorizes into independent terms for each character in xi.

Pi(xi|σ)=j=1IP(xij) (5)

The probability of a state path σ is given by:

Pt(σ)=P(σ1)n=2ΛP(σn|σn-1) (6)

Here, σn is an HPM state. Λ is the total number of states in σ. Given that xm can include deletion characters, Λ is at least as large as L, the total number of non-gap characters observed in sequence x.

HPMs are generative models capable of producing any sequence, while the inclusion of possible self-transitions within insert states allows for the generation of sequences of any length. The steps to generate a sequence are:

  1. Choose the path σ by performing a random walk along the state path following Pt(σ).

  2. Choose insert state characters by independently drawing from Pi(xi|σ).

  3. Generate match column characters from a single emission from PPotts(xm).

  4. Dealign the sequence by combining xm and xi and removing deletion characters. This corresponds to choosing the unaligned sequence x for which P(x|σ,xm,xi) is 1.

Training an HPM

Our hidden Potts model design contains two independent pieces, the pHMM-like and Potts-like parts of the model, that we train separately. We train the pHMM-like elements (Pt, Pi) using the HMMER software package program hmmbuild with an MSA as input [5]. Additionally, hmmbuild annotates which columns in the input MSA correspond to consensus positions: columns with fewer than 50% gaps are deemed to be consensus. Using only the consensus columns, we then train a Potts model, PPotts, with the Gremlin structure prediction software [11]. Gremlin uses an approximate pseudolikelihood maximization method to estimate the hk and ekl Potts model terms [37]. We combine Pt, Pi, and PPotts with our own code (hpmbuild) to produce a fully parameterized HPM.

A Potts model produced by Gremlin is not normalized, as the partition function Z is not calculated. Therefore, we are only able to calculate the probability of a given xm under a Potts model up to Z.

PPotts*(xm)=ZPPotts(xm) (7)

Thus, we only know the marginal probability of a sequence under an HPM up to a factor of Z.

P*(x)=ZP(x) (8)

Implications of using unnormalized probability distributions are discussed below.

Importance sampling alignment algorithm

In order to use HPMs in remote homology search, we need algorithms to efficiently align and score sequences of varying length with a parameterized HPM. Profile HMMs and profile SCFGs use dynamic programming algorithms to optimally align and score sequences in polynomial time. However, dynamic programming algorithms do not work in models like HPMs with non-nested correlation terms (ekl’s).

Aligning a sequence consists of finding the best possible combination of σ, xm, and xi for a given sequence under an HPM.

{xm,xi,σ}ali=argmax{xm,xi,σ}P*(x,xm,xi,σ) (9)

Scoring a sequence entails summing the unnormalized joint probabilities for a given sequence x and the set of all possible combinations of σ, xm, and xi to find the unnormalized probability that the HPM generated the sequence, P*(x).

P*(x)={σ,xm,xi}P*(x,xm,xi,σ) (10)

Evaluating all possible combinations of σ, xm, and xi by brute force enumeration is computationally intractable, so we need an approximate method to align and score sequences efficiently. One could imagine using Monte Carlo integration, randomly sampling a finite number of combinations from P*(xm,xi,σ) and approximating the marginalization sum above to obtain P*(x). However, the space of possible combinations is enormous, and few of the sampled combinations would yield non-negligible probabilities P*(x,xm,xi,σ).

To more efficiently score and align sequences, we use a related method, importance sampling, illustrated in Fig 2 and described in detail in S1 Appendix. Under importance sampling, instead of sampling xm, xi, and σ from the prior distribution P(xm,xi,σ), we instead sample a smaller number of paths from a different distribution, the “proposal” distribution, Q(xm,xi,σ), in which the probability mass for all three variables is much more concentrated.

Fig 2. Importance sampling alignment algorithm schematic.

Fig 2

Toy example of aligning an RNA sequence CUCUGGAAG to models of its sequence/structure consensus, where the true structural alignment has two nested base pairs (brackets in structure line) and one pseudoknot (‘Aa’). Suboptimal alignments of the sequence are sampled probabilistically using a pHMM. A pHMM does not capture residue correlations due to base-pairing, so only some proposed alignments satisfy the expected consensus nested (cyan) or pseudoknotted (pink) base pairs. The proposed alignments are re-scored and re-ranked under the HPM, which does capture correlation structure; the correct alignment with the highest probability under the HPM is identified (green), and the sequence’s total probability P(x) is obtained by importance-weighted summation over the sampled alignments.

Using importance sampling, for R samples, P*(x) is approximated by:

P*(x)1Rr=1RP(x|xm(r),xi(r),σ(r))P*(xm(r),xi(r),σ(r))Q(xm(r),xi(r),σ(r)) (11)

The approximate alignment is the sampled combination of σ, xm, and xi that maximizes the numerator of the importance sampling sum.

For a proposal distribution, we use the posterior probability of an HPM path, match sequence, and insert sequence given an unaligned sequence under a pHMM, PHMM(xm,xi,σ|x). We show in S1 Appendix that PHMM(xm,xi,σ|x) is equivalent to PHMM(π|x), where π is a pHMM-style path with delete states. The PHMM(π|x) posterior distribution over all possible alignments of a sequence to a pHMM can be sampled exactly and efficiently by stochastic tracebacks of the Forward dynamic programming matrix [1, 38], requiring one O(ML) time calculation of the Forward matrix followed by an O(L) time stochastic traceback per sampled alignment. (Recall M is the number of match states in the HPM and L is the number of non-gap characters in unaligned sequence x.) Re-scoring each sampled alignment with the HPM takes O(M2+L) time, where the most expensive step is evaluating all-vs-all pairwise ekl terms in the Potts emission of xm. Overall, for a proposed sample of R alignments, our approach takes O(ML+RL+RM2) time, typically dominated by the RM2 term.

Intuitively, our approach assumes that a pHMM has the alignment reasonably well constrained, though not correct in detail, which is our experience with alignments of remotely homologous sequences that fall near detection thresholds. Because the pHMM neglects residue correlations, the alignment that best satisfies conserved correlation patterns is often not the optimal pHMM alignment. So long as this alignment is probable enough to be present in a large stochastic sample from the pHMM’s distribution over alignment uncertainty, rescoring that sample with the more complex HPM will find it.

We have taken 106 samples from PHMM(π|x) per sequence when aligning and scoring, which takes roughly 1 minute per sequence for typical M and L. Our tests indicate that the importance sampling approximation for P*(x) generally converges by 106 samples, and the optimal alignment is generally stable (see S1 Appendix for more information).

To score homology searches, we use a standard log odds score, in which we compare P*(x) to the probability that x was generated by a null model, PN(x) [1].

S*(x)=log(P*(x)PN(x)) (12)

Ideally, we would use a normalized HPM distribution P(x), which requires knowing the partition function, to produce a normalized log odds score S(x). Calculating Z exactly using all possible match state subsequences is intractable. As Z is a constant for a particular model, log odds scores can be compared relatively across different searches with a given query HPM, but not qualitatively across different query HPMs (with different unknown Z’s). However, knowing Z and having normalized log odds scores would not give us statistical significance values (E- and p-values), which also depend on the size of the target database. For future work, both the partition function and statistical significance will have to be evaluated empirically. We can still use the importance sampling alignment algorithm to find a maximum likelihood alignment of a target sequence to a query HPM or to calculate the posterior probabilities of alignments of target sequences to an HPM, tasks for which Z cancels and statistical significance estimations are not strictly needed.

HPMs perform promisingly

To test the performance of HPMs in remote homology search and alignment, we designed benchmark experiments using a few deep, well-curated RNA alignments. We choose to test on RNA alignments for three reasons. First, secondary and tertiary structure in RNA is largely dictated by nucleotide base-pairing, which makes the ekl terms in the HPM easily interpretable. Second, pSCFG methods already use RNA secondary structure conservation patterns, so we can compare the performance of HPMs not just to sequence-only models like pHMMs, but also to existing models that already capture most RNA pairwise correlation structure. Finally, because the residue correlation structure in RNAs largely consists of disjoint base pairs, we can make some back-of-the-envelope observations about how much additional information content an HPM will capture compared to a pSCFG or a pHMM. This estimation serves as a gauge of how much we might be able to improve homology search and alignment, as discussed next.

Information content of RNA multiple sequence alignments

We can get a sense of how much additional statistical signal is capturable by HPMs compared to pSCFGs and pHMMs from the Shannon relative entropies (information content) of single MSA columns and column pairs. The relative entropy of a residue emission probability distribution for a single position [39] is roughly the expected score per position in a pHMM. The difference between the relative entropy of the joint emission distribution for two positions and the sum of their individual relative entropies is the mutual information, the extra information (in bits) gained by treating the two positions as a correlated pair. We define secondary structure information content as the summed mutual information of all nested base pairs in an RNA consensus structure; this is a measure of how much statistical signal a pSCFG gains over a pHMM for an RNA alignment. Similarly, we can define “tertiary structure” information content as the summed mutual information of all other disjoint base pairs in an RNA consensus structure, such as pseudoknots, not included in a nested pSCFG description. This is not a complete picture of the statistical signal capturable by an HPM, because the measure neglects non-disjoint pairs (e.g. base triples) and other higher-order correlations that an HPM can also capture. However, RNA pseudoknots are one of the most important features we aim to capture in more powerful sequence homology search and alignment models.

As shown in Fig 3, most of the information content in Rfam RNA MSAs results from primary and secondary structure conservation. Tertiary structure information content contributes to a lesser degree. These results indicate that HPMs have the potential to only slightly gain sensitivity over pSCFGs in RNA homology search. However, when trying to improve the sensitivity of homology search, even small increases in signal are potentially useful; for instance, a 2-bit increase in log odds score corresponds to a 4-fold decrease in E-value [40].

Fig 3. Additional information contained in pairwise covariation in RNA MSAs.

Fig 3

For each of the 127 Rfam 14.1 seed MSAs with more than 100 sequences [36], we infer a predicted consensus structure (including nested and non-nested base pairs) using CaCoFold and R-scape [41, 42]. (A) Primary information content (X axis) versus the sum of primary and secondary information content (Y axis). (B) Sum of primary and secondary information content (X axis) versus the information content from all three levels of structure (Y axis) (C) Secondary information content (X axis) versus the sum of information content from secondary and tertiary structure (Y axis). Not included in (C) is ROOL (RF03087), with 305.3 bits of primary information, 174.5 bits of secondary information, and 10.7 bits of tertiary information.

Benchmark procedure

We choose to use a small number of well-studied sequence families of known 3D structure, as opposed to larger and more systematic benchmarks, in order to be able to study individual alignment results in detail. We use three RNA alignments for these experiments: an alignment of 1415 transfer RNA (tRNA) sequences [43], a 1446-sequence Twister type P1 ribozyme MSA [44], and the 433-sequence SAM riboswitch seed alignment from Rfam family RF00162 [36]. Summary statistics for each benchmark dataset are shown in Table 1. Each training alignment contains at most a small amount of information content resulting from pairwise covariation induced by conserved tertiary structure.

Table 1. Benchmark dataset statistics and training alignment information content.
Family Ntrain Ntest Consensus Length 1° info (bits) 2° info (bits) 3° info (bits)
tRNA 1357 30 72 40.3 26.7 1.2
Twister 1005 40 65 57.5 4.5 1.3
SAM 192 8 108 121.8 11.1 0.0

Hand-curated MSAs are split into training and test sets based on [45]. For each training MSA, information content in the primary sequence (in bits) is calculated [39], while information in secondary structure (nested base pairs) and tertiary structure (all other disjoint pairwise interactions between sites) is estimated using mutual information [6]. Each family’s consensus structure is inferred using CaCoFold and R-scape on the training alignment [41, 42]. Though R-Scape identifies no tertiary structure using the SAM riboswitch training alignment, a four-base pair pseudoknot has been observed experimentally [46]. This lack of pseudoknot detection is a characteristic of our SAM training alignment; R-scape predicts the pseudoknot when analyzing the RF00162 seed alignment.

To create benchmark datasets, we divide each reference MSA into an in-clade training set and a remotely homologous test set, as we have done in previous work on RNA homology search [45]. To create a challenging test and simulate a search for very distantly related sequences, we cluster sequences by single linkage clustering and then split the clusters so that the following conditions are satisfied:

  • No test sequence is more than 60% identical to any training sequence.

  • No two test sequences are more than 70% identical.

For scoring benchmarks, we add 200,000 randomly-generated, non-homologous decoy sequences to the test set. We use synthetic sequences in order to avoid penalizing a method that identifies remote, previously unknown evolutionary relationships. Decoys are created with characters drawn i.i.d from the nucleotide composition of the positive test sequences, with the length of each decoy matching a randomly-selected positive test sequence. We score all of the test and decoy sequences with homology models built using the training MSAs. We measure scoring effectiveness by creating a ROC plot that shows true positive rate, or sensitivity, as a function of false positive rate.

To test the effectiveness of HPMs in remote homology alignment, we measure how accurately each model aligns remote homologs relative to the original alignment. We define accuracy to be the fraction of residues in consensus columns aligned correctly across all test sequences. We do not measure alignment accuracy in insert columns, as pHMMs, pSCFGs, and HPMs do not align characters in insert states.

Scoring and alignment benchmark results

The chief motivation in using HPMs for homology search is the potential for conserved patterns of pairwise evolution to be captured by the all-by-all pairwise ekl terms. Are the ekl’s useful for homology search? To answer this question, we compare the performance of default HPMs with all ekl’s included (red curves in Fig 4) against HPMs for which the ekl terms are constrained to 0 when training with Gremlin (“no ekl” HPMs, black curves). A no ekl HPM is similar to a pHMM, but there three key differences: gaps are represented by deletion characters rather than deletion states; sequences are aligned using importance sampling rather than the Viterbi algorithm; and Gremlin uses different prior information than HMMER. The HPM outperforms the no ekl HPM in both the tRNA and Twister ribozyme benchmarks, showing greater sensitivity in the ROC plots in Fig 4. (In the SAM riboswitch benchmark, every method tested is able to fully separate homologous test and decoy sequences.) In the three alignment benchmarks (see Table 2), the all-by-all HPM is more accurate than the no ekl HPM. We conclude ekl’s generally add sensitivity to remote homology search and alignment.

Fig 4. Remote homology scoring benchmark results.

Fig 4

Receiver operating characteristic (ROC) plots for the tRNA (A) and Twister type P1 ribozyme (B) benchmarks. The X axis is the fraction of decoys that score higher than a certain threshold (false positive rate), and the Y axis is the fraction of homologous test sequences that score higher than the same threshold (true positive rate, or sensitivity). In the SAM riboswitch benchmark, each model perfectly discriminates homologs from decoys.

Table 2. Remote homology alignment benchmark results.
Model Total Accuracy Nested Base Pairs Pseudoknotted Base Pairs Other Positions
tRNA
HPM 0.863 0.871 N/A 0.850
No ekl HPM 0.700 0.696 N/A 0.706
HMMER 0.641 0.642 N/A 0.639
Infernal 0.901 0.938 N/A 0.843
Nested HPM 0.879 0.900 N/A 0.846
Twister Ribozyme
HPM 0.762 0.762 0.940 0.669
No ekl HPM 0.622 0.592 0.848 0.550
HMMER 0.683 0.675 0.873 0.597
Infernal 0.718 0.766 0.880 0.563
Nested HPM 0.784 0.782 0.980 0.686
Pseudoknotted HPM 0.794 0.797 0.980 0.694
SAM Riboswitch
HPM 0.748 0.753 0.918 0.711
No ekl HPM 0.733 0.737 0.918 0.693
HMMER 0.847 0.851 0.984 0.817
Infernal 0.977 0.993 0.984 0.956
Nested HPM 0.911 0.930 0.984 0.873
Pseudoknotted HPM 0.894 0.916 0.918 0.861

Total alignment accuracy is measured as the fraction of residues in consensus columns aligned correctly across all test sequences, relative to the reference alignment. Accuracy across consensus columns with different types of secondary structure annotation is also included.

But does our proof of principle HPM implementation outperform existing methods? We compare the performance of the HPM to HMMER and Infernal, pHMM and pSCFG homology search tools, respectively [7, 47] (gray and blue lines in Fig 4). The HPM generally outperforms HMMER, only falling short in the SAM riboswitch alignment benchmark. Nevertheless, Infernal is usually more sensitive than the HPM, with the only exception being the Twister ribozyme alignment benchmark.

Why is our HPM implementation not outperforming Infernal? One difference between the two methods is model parameterization. Maximum a posteriori pSCFG parameters are calculated analytically from weighted residue frequencies and priors, and Infernal’s choices for weighting and priors have been optimized for homology search and alignment over decades. In contrast, pseudolikelihood maximization of Potts model parameters is an approximation, and it has not previously been used to train homology search and alignment models. To gain more insight, we bypassed pseudolikelihood maximization by creating “masked” HPMs with non-zero ekl terms restricted to annotated nested base pairs (“nested HPM”, cyan line in Fig 4) or to all annotated disjoint base pairs (“pseudoknotted” HPM, yellow line). For a model with a correlation structure consisting solely of disjoint base pairs, maximum a posteriori Potts model parameters can be estimated directly and analytically like pSCFG parameters by setting ekl terms to negative log joint probabilities and hk, hl terms to zero for k, l pairs. These “masked” HPMs were almost always more sensitive than the original HPM, with performance closer to Infernal’s. This result suggests using the pseudolikelihood approximation to train a Potts model is one of the obstacles to using HPMs for homology search.

The improved performance of the masked HPMs relative to the default HPM prompted us to look further into the HPM’s Potts model parameterization. We examine pairwise probabilities of nucleotides under the HPM at annotated base pairs, estimated using Markov chain Monte Carlo. These probabilities do not reflect the observed pairwise frequencies in the training MSA. For instance, Fig 5 shows disagreement between HPM probabilities and training MSA frequencies for an annotated base pair in the SAM riboswitch training MSA. We see many such examples in the SAM riboswitch example.

Fig 5. Hidden Potts model emission probabilities do not match training alignment statistics.

Fig 5

(A) Rfam consensus structure for the class I SAM riboswitch. Base pairs supported by statistically significant covariation in analysis by R-scape in the RF00162 seed alignment are shaded green [41]. (B) Observed pairwise nucleotide frequencies for one base pair in the P3 stem (sites 52 and 62) in our RF00162 benchmark training alignment (192 sequences). (C) Pairwise nucleotide probabilities at sites 52 and 62 under the RF00162 training HPM. (D) Pairwise nucleotide probabilities at sites 52 and 62 under the RF00162 training Infernal pSCFG. Infernal’s informative priors lead to a UG base pair being given significant probability despite being rarely seen in the training data. However, as U is much more common than G at site 52 and vice versa at site 62, Infernal assigns a higher probably to UG than GU at these sites.

However, the Gremlin-trained Potts model is accurate in its intended purpose: predicting molecular structure from aligned sequence data. For all three benchmarks, Gremlin’s predicted contacts heavily overlap with annotated base pairs or other observed structural interactions (Fig 6).

Fig 6. Gremlin-trained Potts models accurately predict 3D RNA contacts.

Fig 6

Top M2 predicted 3D contacts from Gremlin-trained Potts models (cyan) against annotated nested base pairs (gray) and annotated tertiary contacts/pseudoknotted positions (pink) from the tRNA (A), Twister type P1 ribozyme (B), and SAM riboswitch (C) training alignments. For tRNA, annotated tertiary contacts are from a yeast tRNA-phe crystal structure [48]. A non-structural covariation between the tRNA discriminator base and the middle nucleotide in the anticodon (caused by aminoacyl-tRNA synthetase recognition preferences, not conserved base-pairing) is noted in orange [49].

Characterizing HPM parameterization issues with simulated data

As an additional check on our results and parameterization issues, we performed positive controls on synthetic data, using a synthetic multiple sequence alignment sampled from an HPM with known parameters. This procedure ensures both the HPM parameters and the sequence alignment are ground truth. To execute these experiments with realistic parameters (as opposed to random), we use a multi-step procedure starting from a real input MSA, the Twister Ribozyme benchmark training MSA (here called “MSA1”). We train an HPM on MSA1 to obtain the “training HPM” with ground truth parameters. We generate 1000 independent aligned sequences from the training HPM by Markov chain Monte Carlo (MCMC); this alignment is our ground truth MSA (“MSA2”). Now we train a new HPM on MSA2 (the “synthetic HPM”). We use this experiment to test MCMC convergence, alignment accuracy, and parameter fitting. Aligning MSA2 sequences to the synthetic HPM assesses alignment accuracy. Beyond comparing raw Potts model parameters, we can also assess how well HPM parameterization matches marginal single-site residue probabilities, marginal pairwise residue probabilities, and mutual information (pairwise correlation). We perform this comparison by generating a third alignment of 1000 sequences (“MSA3”) from the synthetic HPM and comparing MSA2 versus MSA3 statistics.

This positive control yields mixed results. We observe adequate convergence of our MCMC sampling of sequences (Fig 7A) and HPM alignment accuracy under ideal synthetic conditions (alignment accuracy of 0.975, compared to 0.972 with Infernal). However, the most striking result from these experiments reveals how pseudolikelihood parameterization tends to suppress and underestimate pairwise correlations. The synthetic HPM’s Potts parameters are flattened relative to the training HPM, especially the pairwise ekl terms (Fig 7B and 7C). Although marginal single-site and pairwise residue probabilities are reproduced reasonably accurately (Fig 7D and 7E), pairwise correlation (mutual information) is systematically underestimated (Fig 7F). We initially found this result surprising and counterintuitive, since mutual information is calculated from the single-site and pairwise probabilities. However, we find that the errors in estimating marginal single-site and pairwise probabilities are correlated, not independent, and in the direction of suppressing pairwise correlation.

Fig 7. Pseudolikelihood maximization does not fully recapitulate the pairwise correlation structure of input alignments.

Fig 7

(A) Average value of the Potts Hamiltonian at intermediate iterations of the Markov chain Monte Carlo sequence emission algorithm across 1000 synthetic sequences. Error bars represent standard deviation. We accept sequences after 5000 iterations. (B-C) Comparison of single-site (A) and pairwise (B) Potts model terms of a ground truth HPM (“training HPM”) versus in an HPM trained on aligned sequences emitted from the ground truth HPM (“synthetic HPM”). (D-F) Comparisons of marginal single-site probabilities (D), marginal pairwise probabilities (E), and mutual information (F) for the posterior sequence distributions sampled from the training HPM (statistics estimated from MSA2) against the synthetic HPM (statistics estimated from MSA3).

Since the goal of using HPMs is to capture pairwise correlation structure in sequence alignments, the failure of pseudolikelihood maximization training to accurately reproduce pairwise mutual information is a major issue for remote homology search, and it may explain why HPMs are not performing better than Infernal in our benchmarks. Others have previously shown that training Potts models with pseudolikelihood maximization does not accurately reproduce training alignment statistics even when estimating 3D contacts accurately [50, 51]. In a way, we find this result encouraging; as we discuss below, there are alternative training methods to the commonly used pseudolikelihood maximization approach.

Discussion

We present hidden Potts models, generative probability models for sequence homology search and alignment. An HPM is a hybrid between a Potts model, successfully used in molecular structure prediction, and a profile HMM, used in protein and nucleic acid homology search and alignment. The hybrid model configuration uses advantages inherent to both models. By using a pHMM’s transition architecture, an HPM can model variable length sequences, a feature previously incompatible with Potts models. The all-by-all pairwise ekl terms in an HPM’s emission process, as in a Potts model, allow for non-nested structural interactions like pseudoknots to be captured, going beyond the primary-sequence capability of a pHMM and the nested base pair modeling of a pSCFG. Finally, we have developed an efficient algorithm that uses importance sampling to align and score variable length sequences to an HPM.

In initial benchmarking of RNA remote homology search and alignment, we found that HPMs perform promisingly. In experiments where we compare HPMs to HPMs with pairwise ekl terms forced to zero in training (thus a level comparison, within the same HPM implementation, of using versus not using correlation information), fully parameterized HPMs produce more accurate alignments and are generally more sensitive for homology searches. HPMs also generally outperformed HMMER, an existing pHMM method. This indicates that hidden Potts models capture useful conserved higher-order correlation structure information in an alignment-capable model.

On the other hand, our pilot implementation of HPMs generally underperforms Infernal, an existing pSCFG implementation for capturing nested pairwise correlations in conserved RNA secondary structure. This suggests to us that the gains from being able to model pseudoknots and other non-nested RNA correlations are outweighed by deficiencies in our pilot implementation. Our additional experiments implicate model parameterization by pseudolikelihood maximization as a key issue. When we bypass pseudolikelihood maximization by using “masked” HPMs limited to nonzero maximum likelihood ekl’s only for a disjoint set of base pairs, HPM performance tends much closer to Infernal performance. However, there are other Potts model training methods besides pseudolikeihood maximization, as reviewed in [51]. Additionally, we are not yet using any informative prior distributions in the Potts parameterization, comparable to the use of informative mixture Dirichlet priors in pHMMs and pSCFGs.

Besides improved parameterization, there are other issues to address to make HPMs the basis for useful homology search and alignment software tools. First, while the importance sampling alignment algorithm is efficient relative to directly enumerating over all possible alignments, it still takes seconds to minutes to align a single sequence. The method will need to be greatly accelerated. Second, while HPMs are able to handle insertions relative to a consensus with a standard affine gap-open, gap-extend probability, they do not explicitly model deletions. It would be desirable to find a cleaner model of deletions. Other work has attempted to combine Potts models with insertions and deletions using techniques based in statistical physics, though this method has not yet been applied to remote homology search and alignment [52]. Third, we only do “glocal” alignment, where a complete match to the HPM consensus model is identified in a possibly longer target sequence. We do not yet see how to do local sequence alignment to an HPM.

Future work could better characterize the performance of Potts model based methods in remote homology search. In [53], Haldane and Levy study how training alignment depth affects Potts model protein structure and fitness prediction, but no work has studied how the number of sequences affects sensitivity in remote homology search with RNA. In addition, while we use synthetic decoys with characters drawn i.i.d. to benchmark our proof-of-principle HPM implementation for simplicity, we have previously benchmarked more mature homology search methods with decoys that more closely resemble real biological sequences [47]. Further work could use synthetic decoys generated in a different manner by shuffling real biological sequences.

Another recent paper uses a different method to align protein and RNA sequences to Potts models [54]. DCAlign, developed by Muntoni et al., uses an algorithm based on message passing rather than importance sampling. They report promising results when comparing alignment accuracy to HMMER and Infernal. However, they do not attempt to score remotely homologous sequences.

Although we chose to do our initial testing with RNA sequence alignments, protein sequence homology search and alignment will also be of interest in future work. Unlike the case for pSCFGs for RNA, the state of the art for protein sequence homology search and alignment remains primary sequence methods such as HMMER and BLAST. Evolutionarily conserved protein structure creates a complex correlation structure unlike the simpler pairwise correlation patterns that dominate RNA analysis; pairwise correlations in protein alignments are difficult to treat by anything simpler than an all-by-all network in which any pair of sites is potentially correlated, making protein sequence analysis ripe for Potts models.

Materials and methods

HPM software implementation

Our prototype HPM software implementation builds hidden Potts models from multiple sequence alignments, aligns and scores sequences with an HPM using importance sampling, and generates sequences from an HPM using Markov chain Monte Carlo. The software uses functions from HMMER version 4 (in development, see https://github.com/EddyRivasLab/hmmer/tree/h4-develop) and the Easel sequence library (see https://github.com/EddyRivasLab/easel).

To train Potts models, we use a modified version of the Gremlin C++ software package [11], included with our HPM implementation. Code is available at https://github.com/gwwilburn/WilburnEddy20_HPM. A tarball of our software is also included in S1 Code.

HPM training

HPMs are trained by our program hpmbuild. Pt transition parameters and Pi insert emission parameters are taken from a pHMM trained by hmmbuild (HMMER v4). PPotts emission parameters are taken from a model trained by Gremlin C++ (version 1.0) modified to use Henikoff position-based sequence weights in the training procedure [55], the same relative weighting scheme used by HMMER.

Masked HPM training

So-called masked HPMs, with only a disjoint subset of ekl terms allowed to be non-zero, are trained by our program hpmbuild_masked. Annotated base pairs are determined by the secondary structure consensus line in an input Stockholm MSA file. Potts model parameters are negative log likelihoods estimated from weighted counts with a + 1 Laplace pseudocount. Henikoff position-based sequence weights are used [55]. For sites belonging to annotated base pairs, ekl’s are trained using the weighted pairwise nucleotide frequencies in the input MSA; these sites’ hk terms set to zero. For unpaired sites, hk terms are trained using the weighted single-site nucleotide frequencies, and the corresponding ekl’s are set to zero.

Aligning and scoring sequences with an HPM

Sequences are aligned to and scored by an HPM with our program hpmscoreIS. This program takes an HPM, a pHMM with the same number of match states, and a sequence file as input. It returns the unnormalized log odds scores for each sequence estimated using importance sampling with the pHMM as a proposal distribution. Additionally, aligned sequences are outputted to a Stockholm MSA file. For each input sequence, 1 million alignments are sampled from the pHMM.

Estimation of single-site and pairwise probabilities under an HPM

Probabilities are estimated empirically from 10,000 synthetic sequences emitted from an HPM using our program hpmemit.

We use Markov chain Monte Carlo to emit match sequence xm from the Potts model. Each sequence is generated independently using the Metropolis-Hastings algorithm with a burn-in period of 5000 steps.

To create the HPM path σ, we perform a random walk through the HPM’s state transitions. Given σ, we emit insertion sequence xi by drawing residues independently from Pi(xi|σ).

Experiment with simulated sequences

1000 synthetic sequences are emitted from the Twister ribozyme benchmark training HPM using our program hpmemit. Each sequence is generated independently using the Metropolis-Hastings algorithm with a burn-in period of 5000 steps.

An HPM is then trained on the resulting synthetic alignment using hmmbuild, Gremlin, and hpmbuild. This model’s single-site and pairwise frequencies are estimated empirically as before using sequences generated by hpmemit.

The other homology models are then trained on the synthetic MSA. When training with HMMER and Infernal, entropy weighting is turned off using hmmbuild --enone and cmbuild --enone, respectively. As we emit sequences from a glocal HPM, we aslo build the CM such that the flanking insert transitions are learned using the --iflank option.

For alignment accuracy tests, the synthetic sequences are then “realigned” to the models trained on the synthetic MSA.

Step-by-step instructions for recreating the results with synthetic sequences are included with S1 Code.

Software and database versions used

For building pHMMs, we use HMMER 4 (in development). The code is available at https://github.com/EddyRivasLab/hmmer, inside the branch h4-develop. The specific version of code we use is archived under tag 7994cc7 at https://github.com/EddyRivasLab/hmmer/commits/h4-develop. The software is also included in S1 Code.

We use HMMER 4 functions to create programs hmmalign_uniglocal and hmmscore_uniglocal that glocally score and align sequences with pHMMs, respectively. These programs are included in the HPM prototype software implementation.

For analysis with pSCFGs, we use Infernal version 1.1.3 [7]. The code is available at http://eddylab.org/infernal/ and included with S1 Code.

Rfam version 14.1 is used for the SAM riboswitch anecdote [36].

For consensus RNA structure prediction by comparative sequence analysis, we used CaCoFold in version 1.4.0 of the R-scape software package [41, 42].

Benchmark datasets

Training and test alignments used for each anecdote are included as supplementary material with S1 Code. Benchmark procedures follow methods in [45].

Analysis of alignments

RNA multiple sequence alignments were visualized using the Ralee RNA alignment editor in Emacs [56].

Supporting information

S1 Appendix. Derivation of importance sampling alignment algorithm (PDF).

(PDF)

S1 Code. Contains the HPM software implementation, versions of HMMER and Infernal used in benchmarking, the Easel sequence library, and benchmark datasets (ZIP).

(GZ)

Acknowledgments

We thank Elena Rivas, Tom Jones, Nick Carter, and Sergey Ovchinnikov for insightful discussions throughout the project. Additionally, we thank Eric Nawrocki and members of the Eddy and Rivas labs for comments on the manuscript. Some ideas in this work were conceived at workshops hosted at the Centro de Ciencias de Benasque Pedro Pascual (Benasque, Spain), and the Aspen Center for Physics. Computations were run on the Cannon cluster, supported by the Harvard FAS Division of Science’s Research Computing Group.

Data Availability

All relevant data are within the manuscript and its Supporting information files. Additionally, all relevant data will be posted publicly without limitations on http://eddylab.org/publications.html.

Funding Statement

SRE and GWW are funded by the Howard Hughes Medical Institute. (https://www.hhmi.org/). SRE is a Howard Hughes Medical Institute Investigator (grant number n/a). SRE is funded by National Human Genome Research Institute of the National Institutes of Health award R01-HG009116 (https://www.genome.gov/). Some ideas in this work were conceived at workshops hosted at the Centro de Ciencias de Benasque Pedro Pascual in Benasque, Spain (http://www.benasque.org/), and the Aspen Center for Physics, supported by National Science Foundation grant PHY-1066293 (https://www.nsf.gov/). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Durbin R, Eddy SR, Krogh A, Mitchison GJ. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge UK: Cambridge University Press; 1998. [Google Scholar]
  • 2.Weisman CM, Murray AW, Eddy SR. Many but Not All Lineage-Specific Genes Can Be Explained by Homology Detection Failure. biorXiv 968420v2 [Preprint]. 2020 [Cited 11 June 2020]. Available from: https://www.biorxiv.org/content/10.1101/2020.02.27.968420v2 [DOI] [PMC free article] [PubMed]
  • 3. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic Local Alignment Search Tool. J Mol Biol. 1990;215:403–410. 10.1016/S0022-2836(05)80360-2 [DOI] [PubMed] [Google Scholar]
  • 4.Haussler D, Krogh A, Mian IS, Sjolander K. Protein Modeling Using Hidden Markov Models: Analysis of Globins. In: Proceedings of the Twenty-Sixth Hawaii International Conference on System Sciences; 1993. p. 792–802.
  • 5. Eddy SR. Profile Hidden Markov Models. Bioinformatics. 1998;14:755–763. 10.1093/bioinformatics/14.9.755 [DOI] [PubMed] [Google Scholar]
  • 6. Eddy SR, Durbin R. RNA Sequence Analysis Using Covariance Models. Nucl Acids Res. 1994;22:2079–2088. 10.1093/nar/22.11.2079 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold Faster RNA Homology Searches. Bioinformatics. 2013;29:2933–2935. 10.1093/bioinformatics/btt509 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Lapedes AS, Giraud BG, Liu LC, Stormo GD. A Maximum Entropy Formalism for Disentangling Chains of Correlated Sequence Positions. Lecture Notes-Monograph Series, Statistics in Molecular Biology and Genetics. 1999;33:236–256. [Google Scholar]
  • 9. Weigt M, White RA, Szurmant H, Hoch JA, Hwa T. Identification of Direct Residue Contacts in Protein–Protein Interaction by Message Passing. Proc Natl Acad Sci USA. 2009;106:67–72. 10.1073/pnas.0805923106 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS, Sander C, et al. Direct-Coupling Analysis of Residue Coevolution Captures Native Contacts Across Many Protein Families. Proc Natl Acad Sci USA. 2011;108:E1293–E1301. 10.1073/pnas.1111471108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Kamisetty H, Ovchinnikov S, Baker D. Assessing the Utility of Coevolution-based Residue–Residue Contact Predictions in a Sequence-and Structure-Rich Era. Proc Natl Acad Sci USA. 2013;110(39):15674–15679. 10.1073/pnas.1314045110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Ekeberg M, Lövkvist C, Lan Y, Weigt M, Aurell E. Improved Contact Prediction in Proteins: Using Pseudolikelihoods to Infer Potts Models. Physical Review E. 2013;87(1):012707 10.1103/PhysRevE.87.012707 [DOI] [PubMed] [Google Scholar]
  • 13. De Leonardis E, Lutz B, Ratz S, Cocco S, Monasson R, Schug A, et al. Direct-Coupling Analysis of Nucleotide Coevolution Facilitates RNA Secondary and Tertiary Structure Prediction. Nucl Acids Res. 2015;43:10444–10455. 10.1093/nar/gkv932 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Weinreb C, Riesselman AJ, Ingraham JB, Gross T, Sander C, Marks DS. 3D RNA and Functional Interactions from Evolutionary Couplings. Cell. 2016;165:963–975. 10.1016/j.cell.2016.03.030 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. White JV, Muchnik I, Smith TF. Modeling Protein Cores with Markov Random Fields. Math Biosci. 1994;124:149–179. 10.1016/0025-5564(94)90041-8 [DOI] [PubMed] [Google Scholar]
  • 16. Lathrop RH, Smith TF. Global Optimum Protein Threading with Gapped Alignment and Empirical Pair Score Functions. J Mol Biol. 1996;255:641–645. 10.1006/jmbi.1996.0053 [DOI] [PubMed] [Google Scholar]
  • 17. Thomas J, Ramakrishnan N, Bailey-Kellogg C. Graphical Models of Residue Coupling in Protein Families. IEEE/ACM Trans Comp Biol Bioinf. 2008;5:183–197. 10.1109/TCBB.2007.70225 [DOI] [PubMed] [Google Scholar]
  • 18. Liu Y, Carbonell J, Gopalakrishnan V, Weigele P. Conditional Graphical Models for Protein Structural Motif Recognition. J Comput Biol. 1996;255:641–645. [DOI] [PubMed] [Google Scholar]
  • 19. Menke M, Berger B, Cowen L. Markov Random Fields Reveal an N-Terminal Double Beta-Propeller Motif as Part of a Bacterial Hybrid Two-Component Sensor System. Proc Natl Acad Sci USA. 2010;107:4069–4074. 10.1073/pnas.0909950107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Peng J, Xu J. A Multiple-Template Approach to Protein Threading. Proteins. 2010;79:1930–1939. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Daniels NM, Hosur R, Berger B, Cowen L. SMURFLite: Combining Simplified Markov Random Fields with Simulated Evolution Improves Remote Homology Detection for Beta-Structural Proteins into the Twilight Zone. Bioinformatics. 2012;28:1216–1222. 10.1093/bioinformatics/bts110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Ovchinnikov S, Kamisetty H, Baker D. Robust and Accurate Prediction of Residue-Residue Interactions across Protein Interfaces Using Evolutionary Information. eLife. 2014;113:e02030. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Bitbol AF, Dwyer RS, Colwell LJ, Wintergreen NS. Inferring Interaction Partners from Protein Sequences. Proc Natl Acad Sci USA. 2016;106:67–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Gueudre T, Baldassi C, Zamparo M, Weigt M, Pagnani A. Simultaneous Identification of Specifically Interacting Paralogs and Interprotein Contacts by Direct Coupling Analysis. Proc Natl Acad Sci USA. 2016;113:12185–12191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Cong Q, Anishchenko I, Ovchinnikov S, Baker D. Protein Interaction Networks Revealed by Proteome Coevolution. Science. 2019;365:185–189. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Cheng RR, Nordesjo O, Hayes RL, Levine H, Flores SC, Onuchic JN, et al. Connecting the Sequence-Space of Bacterial Signaling Proteins to Phenotypes Using Coevolutionary Landscapes. Mol Biol Evol. 2016;33:3054–3064. 10.1093/molbev/msw188 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Figliuzzi M, Jacquier H, Schug A, Tenaillon O, Weigt M. Coevolutionary Landscape Inference and the Context-Dependence of Mutations in Beta-Lactamase TEM-1. Mol Biol Evol. 2016;33:268–280. 10.1093/molbev/msv211 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Levy RM, Haldane A, Flynn WF. Potts Hamiltonian Models of Protein Co-variation, Free Energy Landscapes, and Evolutionary Fitness. Curr Opin Struct Biol. 2017;43:55–62. 10.1016/j.sbi.2016.11.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Hopf TA, Ingraham JB, Poelwijk FJ, Sharfe CP, Springer M, Sander C, et al. Mutation Effects Predicted from Sequence Co-variation. Nature Biotechnology. 2017;35:128–135. 10.1038/nbt.3769 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Salinas VH, Ranganathan R. Coevolution-Based Inference of Amino Acid Interactions Underlying Protein Function. eLife. 2018;7:e34300 10.7554/eLife.34300 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Graner F, Glazier JA. Simulation of Biological Cell Sorting Using a Two-Dimensional Extended Potts Model. Physical Review Letters. 1992;69:2013–2016. 10.1103/PhysRevLett.69.2013 [DOI] [PubMed] [Google Scholar]
  • 32. Schneidmann E, II MJB, Segev R, Bialek W. Weak Pairwise Correlations Imply Strongly Correlated Network States in a Neural Population. Nature. 2007;440:1007–1012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Chiu DKY, Kolodziejczak T. Inferring Consensus Structure from Nucleic Acid Sequences. Comput Applic Biosci. 1991;7:347–352. [DOI] [PubMed] [Google Scholar]
  • 34. Gutell RR, Power A, Hertz GZ, Putz EJ, Stormo GD. Identifying Constraints on the Higher-Order Structure of RNA: Continued Development and Application of Comparative Sequence Analysis Methods. Nucl Acids Res. 1992;20:5785–5795. 10.1093/nar/20.21.5785 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, et al. Pfam: The Protein Families Database. Nucl Acids Res. 2014;42:D222–D230. 10.1093/nar/gkt1223 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Kalvari I, Argasinska J, Quinones-Olvera N, Nawrocki EP, Rivas E, Eddy SR, et al. Rfam 13.0: Shifting to a Genome-Centric Resource for Non-Coding RNA Families. Nucl Acids Res. 2018;46:D335–D342. 10.1093/nar/gkx1038 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Besag J. Efficiency of Pseudolikelihood Estimation for Simple Gaussian Fields. Biometrika. 1977;64:616–618. 10.1093/biomet/64.3.616 [DOI] [Google Scholar]
  • 38.Eddy SR. Multiple Alignment Using Hidden Markov Models. In: Rawlings C, Clark D, Altman R, Hunter L, Lengauer T, Wodak S, editors. Proc. Third Int. Conf. Intelligent Systems for Molecular Biology. Menlo Park, CA: AAAI Press; 1995. p. 114–120. [PubMed]
  • 39. Schneider TD, Stormo GD, Gold L, Ehrenfeucht A. Information Content of Binding Sites on Nucleotide Sequences. J Mol Biol. 1986;188:415–431. 10.1016/0022-2836(86)90165-8 [DOI] [PubMed] [Google Scholar]
  • 40. Eddy SR. A Probabilistic Model of Local Sequence Alignment that Simplifies Statistical Significance Estimation PLOS Comput Biol. 2008;4:e1000069 10.1371/journal.pcbi.1000069 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Rivas E, Clements J, Eddy SR. A Statistical Test for Conserved RNA Structure Shows Lack of Evidence for Structure in lncRNAs. Nature Methods. 2017;14:45–48. 10.1038/nmeth.4066 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Rivas E. RNA Structure Prediction Using Positive and Negative Evolutionary Information. biorXiv 933952v2 [Preprint]. 2020 [Cited 11 June 2020]. Available from: https://www.biorxiv.org/content/10.1101/2020.02.04.933952v2 [DOI] [PMC free article] [PubMed]
  • 43. Sprinzl M, Horn C, Brown M, Ioudovitch A, Steinberg S. Compilation of tRNA Sequences and Sequences of tRNA Genes. Nucl Acids Res. 1998;26:148–153. 10.1093/nar/26.1.148 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Roth A, Weinberg Z, Chen AG, Kim PB, Ames TD, Breaker RR. A Widespread Self-Cleaving Ribozyme Class is Revealed by Bioinformatics. Nat Chem Biol. 2014;10:56–60. 10.1038/nchembio.1386 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Nawrocki EP, Eddy SR. Query-Dependent Banding (QDB) for Faster RNA Similarity Searches. PLOS Comput Biol. 2007;3:e56 10.1371/journal.pcbi.0030056 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Montange R, Batey RT. Structure of the S-adenosylmethionine Riboswitch Regulatory mRNA Element. Nature. 2006;441:1172–1175. 10.1038/nature04819 [DOI] [PubMed] [Google Scholar]
  • 47. Eddy SR. Accelerated profile HMM searches. PLOS Comp Biol. 2011;7:e1002195 10.1371/journal.pcbi.1002195 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Westhof E, Sundaralingam M. Restrained Refinement of the Monoclinic Form of Yeast Phenylalanine Transfer RNA. Temperature Factors and Dynamics, Coordinated Waters, and Base-Pair Propeller Twist Angles. Biochemistry. 1986;25:4868–4878. 10.1021/bi00365a022 [DOI] [PubMed] [Google Scholar]
  • 49. Crothers DM, Seno T, Soll DG. Is There a Discriminator Site in tRNA? Proc Natl Acad Sci USA. 1972;69:3063–3067. 10.1073/pnas.69.10.3063 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Barton JP, De Leonardis E, Coucke A, Cocco S. ACE: Adaptive Cluster Expansion for Maximum Entropy Graphical Model Inference. Bioinformatics. 2016;32:3089–3097. 10.1093/bioinformatics/btw328 [DOI] [PubMed] [Google Scholar]
  • 51. Cocco S, Feinauer C, Figliuzzi M, Monasson R, Weigt M. Inverse Statistical Physics of Protein Sequences: A Key Issues Review. Reports on Progress in Physics. 2018;81(3):032601 10.1088/1361-6633/aa9965 [DOI] [PubMed] [Google Scholar]
  • 52. Kinjo AR. A Unified Statistical Model of Protein Multiple Sequence Alignment Integrating Direct Coupling and Insertions. Biophysics and Physicobiology. 2016;13:45–62. 10.2142/biophysico.13.0_45 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Haldane A, Levy RM. Influence of Multiple-Sequence-Alignment Depth on Potts Statistical Models of Protein Covariation. Physical Review E. 2019;99(3-1):032405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Muntoni AP, Pagnani A, Weigt M, Zamponi F. Aligning Biological Sequences by Exploiting Residue Conservation and Coevolution. biorXiv 101295v1 [Preprint]. 2020 [Cited 15 June 2020]. Available from: https://www.biorxiv.org/content/10.1101/2020.05.18.101295v1 [DOI] [PubMed]
  • 55. Henikoff S, Henikoff JG. Protein Family Classification Based on Searching a Database of Blocks. Genomics. 1994;19:97–107. 10.1006/geno.1994.1018 [DOI] [PubMed] [Google Scholar]
  • 56. Griffiths-Jones S. RALEE–RNA ALignment Editor in Emacs. Bioinformatics. 2005;21:257–259. 10.1093/bioinformatics/bth489 [DOI] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008085.r001

Decision Letter 0

Jason A Papin, Sushmita Roy

18 Aug 2020

Dear Dr. Eddy,

Thank you very much for submitting your manuscript "Remote homology search with hidden Potts models" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a revised version that takes into account the reviewers' comments.

In particular, please consider Reviewer 1's suggestion for improving clarity of the paper, Reviewer 3's suggestion for providing results with simulations and provide additional user documentation to make the code easily usable and results reproducible.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Sushmita Roy, Ph.D.

Associate Editor

PLOS Computational Biology

Jason Papin

Editor-in-Chief

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Recent work has shown that deep multiple-sequence alignments can be analyzed to find correlations between columns that indicate interactions between residues (nucleotides or amino acids). Much of this work has used "Potts models". The paper under review considers the use of Potts models to perform homology search, i.e., to find sequences that are likely to be homologous to sequences in an existing multiple-sequence alignment. The paper concludes (1) that the extra information in the tertiary interactions that Potts models consider shows promise for improving homology searches, but (2) the approximation in the Potts model leads to inferior discrimination compared to a model analogous to "Covariance Models", which are currently the best general method for RNA homology search. Therefore, it's probably necessary (in future work) to formulate a probabilistic model more like a covariance model, rather than a Potts model. Additionally, the paper devises a method, based on importance sampling, to estimate the likelihood that a given sequence is generated by a given Potts model, a problem whose exact solution is almost certainly intractable.

Homology search is a fundamental and important task. Improvements in accuracy would positively impact many research questions. Moreover, the paper's basic idea (that tertiary interactions could improve accuracy) is very reasonable. I found the paper generally well executed, and its conclusions convince me. There are two aspects that could be regarded as weaknesses, but I think are okay. First, the experiments use only 3 different RNAs, making them arguably somewhat anecdotal. However, I can't credibly argue that the overall conclusions (e.g. in the first paragraph of my review) are likely to change with more RNAs. Therefore, I can't ask the authors to produce more data -- the paper's conclusions are already convincing (at least, to me). The second potential weakness is that the paper is arguably a kind of negative result, in that a main conclusion is that HPM don't work that well, and some other (related) model is likely needed in order to usefully exploit information in interactions. However, I think the basic approach is a reasonable one that was deserving of investigation, and the paper gives important information on what kind of approaches are likely to work better.

My remaining comments are quite minor, mostly directed at improving the descriptions in a few details.

MINOR COMMENTS

W.r.t. this sentence: "A Potts model expresses the probability of a homologous sequence as a function of primary conservation and all possible pairwise correlations between all consensus sites in a biological sequence (i.e., consensus columns in a multiple sequence alignment)." : for readers not in the field, it might be good to define the concept of consensus columns explicitly. It's kind of hinted at, but never defined.

The journal for ref 50 is missing. I see that this is a preprint, but there needs to be more information on where it is.

I found the notation P_m(x_m) a bit strange (e.g., in equation 2). I think this is the probability of x_m according to the model 'm'. Wouldn't it be more conventional to write this as P(x_m|M), and state that M is the model? Also, I don't get why the model is broken into sub-models (as I understand it), e.g., p_t(\\rho).

In equations (5) and (6), lower-case 'p' occurs in the right-hand side of the equation. I assume these should be a capital 'P', e.g. the probability of x_i_j in equation (5)? Otherwise, I'm not clear on what they represent.

In equation (6), the symbol \\rho_n represents the states of the HPM. This could be explicitly stated. Secondly, I think it might help if the states were represented by s_n. In this equation the p and \\rho look similar. Also, I got to this point in the paper believing that the path variable was \\overrightarrow{p} and not \\overrightarrow{\\rho}. I think s would be more conventional.

I'm not sure I understand the description of \\Lambda in equation (6). Does \\rho include deletion characters or not? I think it does (since deletion characters must correspond to states), but the phrasing makes me wonder. Also, why is the variable L introduced here, when it's only actually used a few pages later? I found the latter clause in the sentence (page 7, line 165) more confusing than helpful. If it's left in, I'd recommend changing the semicolon to a new sentence (it looked like a comma when I first read it), and something like "Here, \\Lambda is the total number of states in \\rho. Given that \\rho includes deletion characters, \\Lambda is at least as large as the number of non-gap characters in x." But really, I think this should be moved later in the text.

It would be nice to remind the reader of the meaning of M on page 10? Also, I think the variable L should be defined here. I believe this is the same as the L on page 7.

Since Potts models don't have a notion of insertions, the Potts model is presumably using every column in the input alignment (even gappy ones), right? In this case, is HMMER told to also treat all columns (even gappy ones) as consensus columns? I think this would be good to clarify.

I have some questions about this sentence: "As Z is a constant for a particular model, scores can be compared relatively within a single database search with a given query HPM, but not qualitatively across different query HPMs (with different unknown Z's).". I get why searches with different HPM models (which have different implicit values of Z) can't be compared. But, this sentence implies that searches with the same query HPM of different databases are not comparable (because it specifies "within a single database"). Is this simply because the sizes of the databases are (in general) different, and so the statistical significance of a given score changes? Or is there some other reason?

I agree with the following statement, but I think it needs more support. Not necessarily objective data, but something like a hypothetical scenario, or some kind of an argument illustrating the intuition. "However, when trying to improve the sensitivity of homology search, even small increases in signal are potentially useful."

In Fig. 4B, the black line is an HPM with e_ij set to zero. Isn't this equivalent to a pHMM? Why does it perform so much better than HMMER in this test?

I suspect a bit more analysis of Fig. 4 would be helpful for some readers, and to some extent for me. Mainly, it's apparent that Infernal (Fig 4D) allows for a significant probability for G-U base pairs in comparison to the training MSA (Fig 4B). This is presumably the result of priors that lead Infernal to anticipate the Watson-Crick-compatible G-U pair. Thus, Infernal is distorting the input distribution, but in a desired way. However, I myself am a bit puzzled as to why it doesn't also allow for U-G base pairs. At any rate, the G-A and A-C pairs that the HPM allows (Fig 4C) are not desired. I think a brief explanation of these issues in Fig. 4 would help.

Minor grammar issues / typos:

"the the training alignment"

"with performance is closer to Infernal's"

"Besides improved parameterization, there are other issues to address to make HPMs as the basis for useful homology search and alignment software tools." : delete "as"

"grand number n/a" --> "grant number N/A"

SUGGESTION OUT OF THE SCOPE OF PAPER

If HPMs prove successful, couldn't they be added as a new final step in Infernal's search pipeline? Perhaps the time that HPM alignments take would not longer be such a significant problem in this context. I think this is out of the scope of the paper, but the paper could mention this possible if the authors wish. (If not, I'm fine with that; no need to rebut.)

Reviewer #2: Reproducibility report has been uploaded as an attachment.

Reviewer #3: The submission introduces "hidden Potts models", a novel approach to identify remote homology which extends the profile-HMM approach to Potts models used to model co-evolving sites for structural prediction.

The problem tackled is central to computational biology. The new method is imaginative and thought-provoking. The manuscript is very clearly written.

I am generally very positive about the manuscript. I believe a better characterisation of the approach under the model through simulation and better code usability would facilitate follow-up work on this promising new line of research.

Major points

-----------------

* Given the novelty of the approach and implementation, I would expect some validation using simulated data under the model, to convince that the approach works as expected. For instance, I would think that the following experiments would be informative. e.g.

- parameter estimation: generate sequences under models with specific parameters, and demonstrate that the parameters can be estimated correctly from the data.

- alignment accuracy: can the method recover the correct alignment (i.e. residue-level homology) among a bunch of sequences generated by a model.

- convergence of monte carlo sampling: how long is sampling needed to obtain satisfactory results in the above experiments?

Minor points

-----------------

* "We use synthetic sequences in order to avoid penalizing a method that identifies remote, previously unknown evolutionary relationships. Decoys are created with characters drawn i.i.d from the nucleotide composition of the positive test sequences, with the length of each decoy matching a randomly-selected positive test sequence." One risk with this approach is that the generated sequences have properties that are pretty different from real sequences (i.i.d.). I feel this should be at least discussed. But two simple strategies which could be of interest would be to invert the sequences (which keeps local composition) or shuffling the sequence in local windows.

* I would expect the "no e_{kl} HPM" model to performs similarly or worse than pHMM. Yet it gave better performance on the tRNA model. Any sense why this could be the case?

* "In the three alignment benchmarks (see table 2), the all-by-all HPM is more accurate than the no e_{kl} HPM. We conclude e_{kl}’s generally add sensitivity to remote homology search and alignment." Were parameters also fitted with the constraint that e_{kl} = 0?

* Potts model fitting need very large MSAs. This is likely to be the case here. It would be interesting to see if the number of sequences used for training has an impact on the performance. But in any case this could be discussed.

* Code is provided, but this is quite some way to being "usable". There is hardly any documentation, and no test. The code has some external dependencies which makes compiling non-trivial. No binary is provided. I understand the authors regard their contribution more of a proof of concept than a tool (and indeed the senior author has an outstanding track record providing usable tools) but a little effort toward making the code more usable would surely encourage reuse and extensions. (Actually, I later found that the code and readme files provided as supplementary materials provides sample files and compiling instructions—please include these on the GitHub repo as well please, as most readers are likely to retrieve the code from Github rather than a supplementary file to the manuscript).

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: None

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Anand K. Rampadarath

Reviewer #3: Yes: Christophe Dessimoz

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods

Attachment

Submitted filename: Reproducible_report_PCOMPBIOL_D_20_01035.pdf

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008085.r003

Decision Letter 1

Jason A Papin, Sushmita Roy

27 Oct 2020

Dear Dr. Eddy,

We are pleased to inform you that your manuscript 'Remote homology search with hidden Potts models' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Sushmita Roy, Ph.D.

Associate Editor

PLOS Computational Biology

Jason Papin

Editor-in-Chief

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors have addressed all of my comments. I think the paper can be published in its present form.

Reviewer #2: Reproducibility report has been uploaded as an attachment.

Reviewer #3: I thank the authors for addressing my points thoroughly, and in particular for performing the simulation-based analyses I requested. I have no further reservation.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: None

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Anand K. Rampadarath

Reviewer #3: Yes: Christophe Dessimoz

Attachment

Submitted filename: Reproducible_report_PCOMPBIOL_D_20_01035R1 (1).pdf

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008085.r004

Acceptance letter

Jason A Papin, Sushmita Roy

25 Nov 2020

PCOMPBIOL-D-20-01035R1

Remote homology search with hidden Potts models

Dear Dr Eddy,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Nicola Davies

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Appendix. Derivation of importance sampling alignment algorithm (PDF).

    (PDF)

    S1 Code. Contains the HPM software implementation, versions of HMMER and Infernal used in benchmarking, the Easel sequence library, and benchmark datasets (ZIP).

    (GZ)

    Attachment

    Submitted filename: Reproducible_report_PCOMPBIOL_D_20_01035.pdf

    Attachment

    Submitted filename: reviewer_response_WilburnEddy20.pdf

    Attachment

    Submitted filename: Reproducible_report_PCOMPBIOL_D_20_01035R1 (1).pdf

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting information files. Additionally, all relevant data will be posted publicly without limitations on http://eddylab.org/publications.html.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES