Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2022 Mar 7;18(3):e1009492. doi: 10.1371/journal.pcbi.1009492

Constructing benchmark test sets for biological sequence analysis using independent set algorithms

Samantha Petti 1, Sean R Eddy 2,*
Editor: Maricel G Kann3
PMCID: PMC8929697  PMID: 35255082

Abstract

Biological sequence families contain many sequences that are very similar to each other because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in benchmarking sequence analysis methods. A random split is insufficient because it will yield test sequences that are closely related or even identical to training sequences. Adapting ideas from independent set graph algorithms, we describe two new methods for splitting sequence data into dissimilar training and test sets. These algorithms input a sequence family and produce a split in which each test sequence is less than p% identical to any individual training sequence. These algorithms successfully split more families than a previous approach, enabling construction of more diverse benchmark datasets.

Author summary

Typically, machine learning and statistical inference models are trained on a “training” dataset and evaluated on an separate “test” set. This ensures that the reported performance accurately reflects how well the method would do on previously unseen data. Biological sequences (such as protein or RNA) within a particular family are related by evolution and therefore may be very similar to each other. In this case, applying a standard approach of randomly splitting the data into training and test sets could yield test sequences that are nearly identical to some sequence in the training set, and the resultant benchmark may overstate the model’s performance. This motivates the design of strategies for dividing sequence families into dissimilar training and test sets. To this end, we used ideas from computer science involving graph algorithms to design two new methods for splitting sequence data into dissimilar training and test sets. These algorithms can successfully produce dissimilar training and test sets for more protein families than a previous approach, allowing us to include more families in benchmark datasets for biological sequence analysis tasks.


This is a PLOS Computational Biology Methods paper.

Introduction

Computational methods are typically benchmarked on test data that is separate from the data that were used to train the method [14]. In many areas of machine learning and statistical inference, data samples can be thought of as approximately independent samples from some unknown distribution describing the data. In this case a standard approach is to randomly split available data into a training and a test set, fit a model to the training set, and evaluate the model on the test set. In computational biology, families of biological sequences are not independent because they are related by evolution. Random splitting typically results in test sequences that are closely related or even identical to training sequence, which leads to artifactual overestimation of performance. The problem becomes more concerning for complex models capable of memorizing their training inputs [5]. This issue motivates strategies that consider sequence similarity and split data into dissimilar training and test sets [14].

We are specifically interested in benchmarking methods for remote sequence homology detection. We want to test how well a homology detection method, given a homologous clade of sequences as an input, can detect other homologous sequences in a distant outlying clade. The remote homologs y are not from the same distribution as the known sequences x; they are drawn from some different distribution P(y | x, t), where x are the known sequences and t accounts for evolutionary distances separating remote homolog y from the known examples x on a phylogenetic tree. We can create artificial cases of this by splitting known sequence families phylogenetically at deep ancestral nodes. The difficulty of detecting remote homologs depends more on the distance to the outlying sequences than on details of the tree topology. Therefore, inferring a complete tree topology is unnecessary; it is sufficient and more relevant to have a clear distance-based rule for establishing training and test set splits that are challengingly dissimilar.

Previous work from our group splits a given sequence family into training and test sets using a single-linkage clustering by pairwise sequence identity at a chosen threshold p, such as p = 25% for protein or p = 60% for RNA [6, 7]. One cluster becomes the training set, and the remaining clusters are the source of test sequences. This procedure is a fast proxy for building a phylogenetic tree from distances based on percent identity and selecting test sequences from the outlying clades. We refer to this procedure as the Cluster algorithm in this paper. The procedure guarantees that no sequence in the test set has more than p% pairwise identity to any sequence in the training set. This is a clear and simple rule for ensuring that training and test sets are remotely homologous, and we can control p to vary the difficulty of the benchmark.

We have found that in many cases, the Cluster algorithm is unable to split a family because single-linkage clustering collapses it into a single cluster, but a valid split could have been identified if we removed certain sequences before clustering. For example, if a family contains two groups that would form separate single-linkage clusters at 25% identity and even just one bridging sequence that is >25% identical to a sequence in each group, then single-linkage clustering collapses all the sequences into one cluster. If we omit the bridge sequence, the two groups form separate clusters after single-linkage clustering. The larger the family, the more likely it is to contain sequences that bridge together otherwise dissimilar clusters, so the procedure fails more often on alignments with many sequences. This is a concern because we and others are exploring increasingly complex and parameter-rich models for remote sequence homology recognition that can require thousands of sequences for training [813]. A phylogenetic approach that attempts to identify an out-group would face this same “bridge” issue. In order to produce training/test set splits for benchmarks that cover a more diverse range of sequence families represented by alignments with many sequences, we were interested in improving on Cluster.

Here we describe two improved splitting algorithms called Blue and Cobalt that are derived from “independent set” algorithms in graph theory. A main intuition is that Blue and Cobalt can exclude some sequences as they identify dissimilar clusters. Blue splits more families, but can be computationally prohibitive on alignments with many sequences (over 50,000). Cobalt (a shade of Blue) is much more computationally efficient and is still a large improvement over Cluster. We compare these algorithms to Cluster and to a simple algorithm that selects a training set independently at random, which we call Independent Selection. We compare splitting success and computational time on a large set of different MSAs with 10’s to 100,000’s of sequences. In addition, we compare homology search benchmarks built with these different splitting algorithms.

Results

Given a set of sequences (here, a multiple sequence alignment), the goal is to split it into a training set and a test set, such that no test sequence has > p% pairwise identity to any training sequence and no pair of test sequences is > q% identical. The first criterion defines dissimilar training and test sets, and the second criterion reduces redundancy in the test set. (We preserve the alignment of the training set, including sequence redundancy; the goal of the benchmark is to have realistic query sequence alignments, which do often include redundant sequences. Different homology search methods deal with sequence redundancy in different ways. Most profile construction methods use relative sequence weights, downweighting similar sequences.) The choice of thresholds p and q should be decided based on the goals of the methods being benchmarked.

In order to guarantee that no test sequence has > p% pairwise identity to any training sequence, some sequences will end up in neither the training nor the test set. Our algorithms find training and test sets, and they are returned if they are larger than the user-specified minimum acceptable size for each.

We cast the splitting problem in terms of graph theory with each sequence represented by a vertex and similarity indicated by an edge. For example, a pairwise identity of > p% between two sequences defines an edge for the first criterion. Each splitting method is a two step procedure, for which we use related algorithms. In the first step, we identify disjoint subsets S and T of our original set of sequences, such that for any xS and yT there is no edge (pairwise identity > p%) between x and y. We assign S as the training set and T as the candidate test set. The second step then starts with a graph on T, using pairwise identity threshold q to define edges. We identify a representative subset U such that no pair of vertices y, y′ ∈ U is connected by an edge and assign U to be the test set. The graph problems in steps (i) and (ii) are related. It is useful to discuss the simpler algorithm for step (ii) before describing its adaptation to task (i).

Task (ii) is exactly the well-studied graph algorithm problem of finding an independent set in a graph. Formally, in a graph G = (V, E) with vertex set V and edge set E, a subset of vertices UV is an independent set (IS) if for all u, wU, (u, w) ∉ E. To frame task (i), we define a bipartite independent pair (BIP) as a pair of disjoint sets U1, U2 such that there are no edges between pairs of vertices in U1 and U2, i.e. for all u1U1 and u2U2, (u1, u2) ∉ E. The algorithms we describe here follow this two-step approach, but differ in how they achieve each step.

While it is NP-hard to find a maximum size independent set in a graph [14], randomized algorithms can be applied to quickly find a maximal independent set (an independent set where if any additional vertex were added, the set would no longer be an independent set). The Blue and Cobalt methods are inspired by two such algorithms [15, 16]. Unlike Cluster, Blue and Cobalt always find a maximal independent set in task (ii).

Splitting algorithms

In our descriptions below, vertex w is a neighbor of vertex v if (v, w) is an edge in the graph. The degree of a vertex v, denoted d(v), is the number of neighbors of v. The neighborhood of v in the graph G = (V, E) is N(v) = {wV : (w, v) ∈ E}.

Cobalt

The Cobalt algorithm is an adaptation of the greedy sequential maximal independent set algorithm, studied in [15]. The graph’s vertices are ordered arbitrarily, and each vertex is added to the independent set if none of its neighbors have already been added. Step 2 of Cobalt is this algorithm with the vertex order given by a random permutation. Assigning a vertex to an IS disqualifies all of its neighbors from the IS, and so it may be advantageous to avoid placing large degree vertices in the IS. In Cobalt, higher degree vertices are less likely to be added to the IS; a vertex v is placed in the IS if all of its neighbors come after it in the random order, which happens with probability 1/d(v). The bias towards including low degree vertices, which correspond to “outlier” sequences, is desirable for creating benchmarks that include the most remote homologs.

Algorithm 1: Greedy sequential IS in graph G = (V, E) (Cobalt Step 2)

Result: An independent set U in G = (V, E)

U = ∅

Place the vertices of V in a random order: v1, v2, … vn.

for i=1 to n do

if vi is not adjacent to any vertex in U then U = U ∪ {vi};

end

return U

Step 1 is a variant which instead finds a bipartite independent pair. Once a BIP is found in Step 1, the larger set is declared the training set, and the smaller set is input into the greedy sequential IS algorithm as the vertex set of G2 (Cobalt Step 2). We assign the larger set as the training set because the goal is to benchmark on realistic input alignments, and the larger cluster is more like the original input alignment; additionally, we aim to benchmark methods that may require large numbers of training sequences.

Algorithm 2: Greedy sequential BIP in graph G = (V, E) (Cobalt Step 1)

Result: A bipartite independent pair S, T in G = (V, E)

S, T = ∅

Place the vertices of V in a random order: v1, v2, … vn.

for i=1 to n do

 Sample r ∼ unif(0, 1).

if r < 1/2 then

  if vi is not adjacent to any vertex in S then S = S ∪ {vi};

  else if vi is not adjacent to any vertex in T then T = T ∪ {vi};

else

  if vi is not adjacent to any vertex in T then T = T ∪ {vi};

  else if vi is not adjacent to any vertex in S then S = S ∪ {vi};

end

end

if |S| < |T| then swap the names of S and T;

return S, T

Blue

The Blue algorithm leverages the fact that the number of vertices disqualified by the addition of a vertex v to an IS is not exactly its degree; it is the number of neighbors of v that are still eligible. Blue is based on the IS Random Priority Algorithm introduced by [16]. In each round of this algorithm, the probability of selecting a vertex is inversely proportional to the number of neighbors that are eligible at the beginning of the round.

Each eligible vertex is labeled with a value drawn uniformly at random from the interval [0, 1]. If a vertex has a lower label than all of its neighbors, the vertex is added to the independent set and its neighbors are declared ineligible. This process repeats until there are no eligible vertices. The pseudocode presented here describes the multi-round election process in the most intuitive way. Our implementation avoids storing the entire graph structure G and instead only computes the similarity relationship when algorithm needs to know whether an edge exists.

Algorithm 3: Random Priority IS in graph G = (V, E) (Blue Step 2)

Result: An independent set U in G = (V, E)

U = ∅; L = V

while L ≠ ∅ do

 Declare an empty dictionary.

for each vL do (v) ∼ unif(0, 1);

 Place the vertices of L in a random order: v1, v2, … vk

for i=1 to k do

  if viL and ℓ(vi) < (w) for all wLN(vi) then

   U = U ∪ {vi}

   L = L \ (N(vi) ∪ {vi})

  end

end

end

return U

In our modification of this algorithm to find a BIP, we keep track of each vertex’s eligibility for each of the sets S and T. In each round, every vertex that is eligible for at least one set is declared either an S-candidate or T-candidate and assigned a value uniformly at random from the interval [0, 1]. Each S-candidate is added to S if its label is smaller than the labels of all its neighbors that are both T-candidates and T-eligible. When a vertex v is added to S, v is declared ineligible for both S and T, and all neighbors of v are declared ineligible for T. After iterating through all S-candidates, any T-candidates that are still T-eligible are added to T. Once a BIP is found, the larger set is declared the training set, and the smaller set is input into the greedy sequential IS algorithm as the vertex set of G2 (Blue Step 2).

Algorithm 4: Random Priority BIS in graph G = (V, E) (Blue Step 1)

Result: A bipartite independent pair S, T in G = (V, E)

S, T = ∅; LS, LT = V

while LSLT ≠ ∅ do

CS, CT = ∅

for each vLSLT do

  if vLS \ LT then CS = CS ∪ {v};

  if vLT \ LS then CT = CT ∪ {v};

  if vLTLS then

   Sample r ∼ unif(0, 1).

   if r < 1/2 then CS = CS ∪ {v};

   else CT = CT ∪ {v};

  end

end

 Declare an empty dictionary.

for each vCSCT do (v) ∼ unif(0, 1);

 Place the vertices of CS in a random order: v1, v2, … vk

for i=1 to k do

  if (vi) < (w) for all wLTCTN(vi) then

   S = S ∪ {vi}, LT = LT \ (N(vi) ∪ {vi}) and LS = LS \ {vi}

  end

end

T = T ∪ (CTLT)

for v ∈ (CTLT) do LT = LT \ {v} and LS \ (N(v) ∪ {v});

end

if |S| < |T| then swap the names of S and T;

return S, T

Repetitions of Blue and Cobalt

The use of randomness is a strength of Cobalt and Blue. Unlike Cluster, which produces the same training set and same test set size every time the algorithm is run, the sets produced by Blue and Cobalt may be highly influenced by which vertices are selected first. Running the algorithms many times typically yields different results. We implemented two features to take advantage of this: (i) the “run-until-n” option in which the algorithm runs at most n times and returns the first split that satisfies a user defined threshold, and (ii) the “best-of-n” option in which the algorithm runs n times and returns the split that maximizes the product of the training and test set sizes (i.e., the geometric mean).

Cluster

In the first step, the graph G1 is partitioned into connected components; by definition there is no edge between any pair of connected components. The vertices of the largest connected component are returned as the training set S. The remaining vertices become the set T, and the training set U is formed by selecting one vertex at random from each connected component of the graph G2 with vertex set T.

Independent selection

In the first step, every vertex of G1 is added to set S independently with probability p = 0.70. All vertices that are not in S and not adjacent to any vertex in S are added to T. In the second step, the Greedy sequential IS algorithm (Cobalt Step 2) is applied to G2 (which has vertex set T) to produce a training set U.

Performance comparisons

We compared the success rates for splitting biological sequence families of different sizes by running our algorithms on multiple sequence alignments from the protein database Pfam [17]. To study a wide range of different numbers of sequences per family, we split both the smaller curated Pfam “seed” alignments and the larger automated “full” alignments.

Fig 1 illustrates the pass rates of the algorithms when p = 25% and q = 50%. By setting p = 25%, we are testing how well each homology search method can identify previously unseen distant homologs that are at most 25% identical to a training sequence. Of the 12340 Pfam seed families with at least 12 sequences, Blue splits 34.4%, Cobalt splits 29.0%, Cluster splits 19.1%, and Independent Selection splits 6.8% into a training-test set pair with at least 10 training and 2 test sequences. After running Blue and Cobalt 40 times each, 59.8% and 55.9% of the families (respectively) are successfully split. For the Pfam full families, we require that the training and test sets have size at least 400 and 20 respectively. Of the 9827 Pfam full families with at least 420 sequences, Blue splits 30.5%, Cobalt 28.4%, Cluster 14.0%, and Independent Selection 3.0%. The algorithms were considered unsuccessful on the 188, 2, and 1 families that Blue, Cluster, and Cobalt did not finish in under 24 hours. The success rates of Blue and Cobalt increase to 53.6% and 50.1% after 40 iterations.

Fig 1. Performance of splitting algorithms on Pfam families.

Fig 1

(A) Fraction of the 12340 Pfam seed families with at least 12 sequences that were split into a training set of size at least 10 and test set of size at least 2. The numbers on the Blue and Cobalt bars indicate the fraction of families successfully split at least once out of 1, 5, 10, 20, 40 independent runs. (B) Fraction of the 9827 Pfam families with at least 420 sequences in their full alignment that were split into a training set of size at least 400 and test set of size at least 20.

Fig 2 illustrates the characteristics of the full families that are successfully split by the algorithms at the 400/20 threshold. S1 Fig is the analogous plot for the seed families at the 10/2 threshold. The algorithms struggle to split smaller families and families in which a high fraction of the sequence pairs are at least 25 percent identical. S2 and S3 Figs illustrate the sizes of the training and test sets produced by the four algorithms.

Fig 2. Characteristics of Pfam full families successfully split.

Fig 2

Each marker represents a family in Pfam. The connectivity of a sequence is the fraction of other sequences in the full family with at least 25% pairwise identity. Families successfully split into a training set of size at least 400 and a test set of size at least 20 are marked by a cyan circle, whereas families that were not split are marked by a red diamond. In (B) and (D) the cyan circle represents at least one successful split among 40 independent runs. The 34 families that Blue did not finish splitting within 6 days are not included in the Blue plots.

We also compare the running times of our implementations of each algorithm. Table 1 displays the runtime of the algorithms on the multi-MSAs for the Pfam seed and full databases. All algorithms can split the entire Pfam seed database in under four minutes. Most Pfam full families can be split in under one minute. Fig 3 illustrates the runtimes as a function of the product of the number of sequences and the columns in the alignment. Our implementations take as input a set of N sequences and only compute the distance between a pair of sequences if the algorithm needs to know whether there is an edge between the corresponding vertices. In the worst case (a family with no edges), our algorithm must compute O(N2) distances. Computing percent identity is O(L) where L is the length of the sequence. Therefore when distance is percent identity, the worst case runtime is O(LN2).

Table 1. Runtime of implementations on Pfam seed and full.

The runtime benchmarks were obtained by running each algorithm on the seed and full multi-MSAs Pfam-A.seed and Pfam-A.full on 2 cores with 8 GB RAM for the seed alignments and on 3 cores with 12 GB RAM for the full alignments. We did not compute the maximum runtime of the Blue algorithm; the algorithm failed to terminate within 6 days for 34 families.

Algorithm All seed (min:sec) All full (days-hours:min) Max full (hours:min) Full families >1 min
Blue 3:16 1422 (7.9%)
Cobalt 0:43 7–0:24 46:25 419 (2.3%)
Cluster 0:58 5–0:31 37:17 244 (1.3%)
Indep. Selection 0:19 0–5:49 1:30 48 (0.2%)

Fig 3. Runtime of algorithms.

Fig 3

Each algorithm was run once on each Pfam seed and full alignment for at most 6 days. The runtimes are reported as a function of the product of the number of sequences and the number of columns in the alignment, as bar plots including outliers (translucent grey circles). The boxes extend from the first to third quartile, and the median is marked by a horizontal line. The results for families with at most 10,000 sequences were obtained on 2 cores and 8 GB of RAM, and the remaining were obtained on 3 cores and 12GB of RAM. The results do not include 34 families that Blue did not finish running within 6 days. Blue finished 939 of 944 families in the [106, 107) range, 58 of 85 families in the [107, 108) range, and 1 of 3 families in the [108, 109) range (and we omitted a bar plot for Blue for [108, 109)).

Benchmarking homology search methods with various splitting algorithms

All four algorithms produce splits that satisfy the same dissimilarity criteria (p = 25% and q = 50%), but we noticed that the different procedures create training-test set pairs that are more or less challenging benchmarks. To study this, we used the four algorithms in a previously published benchmark procedure [7] that evaluates a method’s ability to detect whether a sequence contains a subsequence homologous to a Pfam family. Briefly, negative decoy sequences are synthetic sequences generated from shuffled subsequences randomly selected from UniProt, and positive sequences are constructed by embedding a single test domain sequence into a synthetic sequence.

We applied each algorithm to the Pfam seed families with the requirement that there be at least 10 training and 2 test sequences. To avoid over-representing families that yielded large test sets, all test sets were down-sampled to contain at most 10 sequences. First we used these splits to benchmark profile searches with the HMMER hmmsearch program [18]. As illustrated by Fig 4, ROC curves vary substantially based on the splitting algorithm used. The reported accuracy for hmmsearch is highest for the benchmark produced by Independent Selection, followed by the benchmarks produced by Cobalt, Blue, and then Cluster.

Fig 4. Benchmarks of HMMSEARCH.

Fig 4

(A) Each benchmark includes data from all families that were split into training and test sets of size at least 10 and 2 respectively by one run of the algorithm. The number of families included in the benchmark for each algorithm is stated in the labels. For each family, HMMER produces a single profile from the alignment of the training sequences. We constructed 200,000 decoy sequences from shuffled subsequences chosen randomly from UniProt. At most 10 positive test sequences are constructed by embedding a single homologous domain sequence from the test set into synthetic decoy sequence. (See Methods) The x-axis represents the number of false positives per profile search and the y-axis represents the fraction of true positives detected with the corresponding E-value, over all profile searches. The error bars at each point represent a 95 percent confidence interval obtained by a Bayesian bootstrap. (B) The faded lines are copies of the plot (A). The dark lines are the analogous curves constructed by restricting to the benchmarks to the 708 families successfully split by all four algorithms. (C) The distribution of the distances between each test sequence and the closest training sequence (measured in percent identity) for families split by Blue, Cobalt, and Cluster.

We consider two hypotheses for why HMMER performance depends on the splitting method: (i) the families that are successfully split by a particular algorithm are also inherently easier or harder for homology recognition, and (ii) the splitting algorithms create training and test sets with inherently different levels of difficulty.

To explore the first hypothesis, we compiled ROC curves for the 708 families split by all four algorithms. Fig 4B shows that the ROC curves for Blue and Cobalt are brought closer to the ROC curve for Independent Selection, and so hypothesis (i) may explain some of the discrepancy between the Blue, Cobalt, and Independent Selection benchmarks. However, hypothesis (i) does not explain the discrepancy with the Cluster benchmark because the Blue and Cobalt ROC curves are even farther from the Cluster ROC curve under the family restriction.

The second hypothesis is likely a better explanation. A sequence that is less than 25% identical to all other sequences in the family is probably the hardest sequence for a homology search program to recognize. If such a sequence exists, the Cluster algorithm will always assign it to the test set, whereas Blue, Cobalt, and Independent selection will assign it to the test set 50, 50, and 30 percent of the time respectively. Fig 4C illustrates distribution of distances (in percent identity) between each sequence in the test set and the closest sequence in the training set. The test sequences are on average farther from the closest training sequence under the Cluster algorithm.

Since the different algorithms lead to different performance results with one homology search program, we then wanted to see if the choice of splitting algorithm alters the relative performance in a comparison of different homology search algorithms. Fig 5 demonstrates that the relative ranking of the performance of various homology search algorithms is approximately the same regardless of which splitting algorithm was used to produce the split of the data into training and test sets. In addition to HMMER, we benchmarked BLASTP, PSI-BLAST, and DIAMOND. PSI-BLAST performs a BLAST search with a position-specific scoring matrix determined in our case from the set of training sequences [19]. DIAMOND is a variant BLASTP that utilizes double indexing, a reduced alphabet, and spaced seeds to produce a faster algorithm [20]. DIAMOND is benchmarked using “family pairwise search,” in which the best E-value between the target sequence (positive test or negative decoy) and all sequences in the training set is reported [21]. DIAMOND is designed for speed, not sensitivity, and its low sensitivity is apparent. Running DIAMOND with the “sensitive” flag (denoted diamond-sen in Fig 5) improves accuracy, but it remains less accurate than PSI-BLAST, BLASTP, and HMMER. The choice of splitting algorithm does not alter the relative order of performance of the four search algorithms.

Fig 5. Homology search benchmarks on data produced by splitting algorithms.

Fig 5

The benchmarks are constructed as in Fig 4. Blue 40 and Cobalt 40 refer to the algorithms run with the “best-of-40” feature. BLASTP and DIAMOND are benchmarked using family pairwise search.

Discussion

We present two new algorithms, Blue and Cobalt, that are able to split more Pfam protein sequence families into training and test sets so that no training-test sequence pair is more than p = 25 percent identical and no test-test sequence pair is more than q = 50 percent identical. Our algorithms are able to split approximately three times as many Pfam families as compared to the Cluster algorithm we have used in previous work [6, 7, 10], and more than six times as many families as compared to a simple Independent Selection algorithm (see Fig 1). Our algorithms allow us to create larger and more diverse benchmarks across more Pfam families, and also to produce training sets with thousands of sequences for benchmarks of new parameter-rich machine learning models. The Blue algorithm maximizes the number of families included; the faster Cobalt algorithm is recommended for splitting large sequence families.

Blue and Cobalt are random algorithms that typically create different splits each time they are run. Although this is useful, different splits are unlikely to be independent. The variation between splits will depend on the structure of the graph for the sequence family. Different splits are not suited for a procedure like k-fold cross-validation in machine learning, for example.

We were initially surprised to find that for the same sequence identity thresholds, the four splitting algorithms result in benchmarks of varying challenge levels for homology search algorithms. However, within a given benchmark, relative performance of different algorithms is unaffected by the choice of splitting algorithm. Moreover, since the dissimilarity requirement p is an input, the difficulty of a benchmark is tunable.

These algorithms address a fundamental challenge in training and testing models in biological sequence analysis. Random splitting into training and test data assumes that all data points are independently and identically drawn from an unknown distribution P(x). A model of P(x) is fitted to the training data and evaluated on the held-out test data. In contrast, the remote homologs y that we are interested in identifying come from a different distribution than the the known sequences x. The distribution P(y | x, t) depends on both the known sequences x and some measure of the evolutionary distance between the homologs t. In machine learning, “out of distribution” recognition typically means flagging anomalous samples, but this is a case where it is the task itself [22]. Our procedures create out-of-distribution test sets, with dissimilarity of the training/test distributions controlled by the pairwise identity parameter p. The out-of-distribution nature of the remote homology search problem affects not only how appropriate benchmarks are constructed, but also how improved methods are designed.

Materials and methods

Details of benchmarking procedure

We used the benchmarking pipeline as described in [7], as implemented in the “profmark” directory and programs in the HMMER software distribution. Briefly: for a given input multiple sequence alignment (MSA), first remove all sequences whose length is less than 70% of the mean. Then the splitting algorithm produces a training set and a test set. The training set sequences remain aligned according to the original MSA, and the sequence order is randomly permuted. This alignment is used to build a profile in benchmarks of profile search methods such as HMMER “hmmsearch” and PSI-BLAST.

The test set is randomly down-sampled to contain at most 10 sequences. Pfam MSAs consist of individual domains, not complete protein sequences. Each test domain sequence is embedded in a synthetic nonhomologous protein sequence as follows: (i) draw a sequence length from the distribution of sequence lengths in UniProt that is at least as long as the test domain (ii) embed the test domain at a random position, (iii) fill in the remaining two segments with nonhomologous sequence by choosing a subsequence of the desired length from UniProt and shuffling it. The resultant sequences form the positive test set for the particular family. Next form a shared negative test set of 200,000 sequences similarly as follows: (i) choose a positive test sequence at random (from the full group of test sequences) and record the lengths of the three segments, (iii) fill in each segment as described in step (iii) of producing positive sequences. The default “profmark” procedure in HMMER embeds two test domains per positive sequence (for purposes of testing multidomain protein parsing); for this work we used the option of embedding one domain per positive sequence.

Hardware, software and database versions used

All computations were run on Intel Xeon 6138 Processors at 2.0 Ghz. Our time benchmarks were measured in real (wall clock) time. Our tests were performed on the Pfam-A 33.1 database, released in May 2020. We used UniProt release 2/2019. Software versions used: HMMER 3.3.1, BLAST+ 2.9.0, DIAMOND 0.9.5.

Supporting information

S1 Fig. Characteristics of Pfam seed families successfully split.

Each marker represents a family in Pfam. The connectivity of a sequence is the fraction of other sequences in the seed family with at least 25% pairwise identity. Families successfully split into a training set of size at least 10 and a test set of size at least 2 are marked by a cyan circle, whereas families that were not split are marked by a red diamond. In (B) and (D) the cyan circle represents at least one successful split among 40 independent runs.

(TIF)

S2 Fig. Size of training and test sets produced by each algorithm on seed families.

The two-dimensional normalized histograms illustrate the distribution of training and test set sizes produced by the algorithms among results with at least 10 and 2 training and test sequences respectively. In each plot, the x-coordinate and y-coordinates of the green circle represent the median training and median test set sizes respectively. The white X is placed at the median training and test set sizes among the 2363 families that were successfully split by Blue, Cobalt, and Cluster.

(TIF)

S3 Fig. Size of training and test sets produced by each algorithm on full families.

The two-dimensional normalized histograms illustrate the distribution of training and test set sizes produced by the algorithms among results with at least 400 and 20 training and test sequences respectively. In each plot, the x-coordinate and y-coordinates of the green circle represent the median training and median test set sizes respectively. The white X is placed at the median training and test set sizes among the 1070 families that were successfully split by Blue, Cobalt, and Cluster.

(EPS)

Acknowledgments

The computations in this paper were run on the Cannon cluster supported by the FAS Division of Science, Research Computing Group at Harvard University.

Data Availability

The splitting algorithms are implemented in C and available here: https://github.com/EddyRivasLab/hmmer/tree/develop. To run the algorithms, the following version of EASEL is needed: https://github.com/EddyRivasLab/easel/tree/develop. The code used to generate the figures in this paper is available at https://github.com/spetti/split_for_benchmarks.

Funding Statement

SP is funded by the NSF-Simons Center for Mathematical and Statistical Analysis of Biology at Harvard (award number #1764269, https://quantbio.harvard.edu/mathbio). SRE is funded by the National Human Genome Research Institute of the National Institutes of Health under Award Number R01-HG009116 (https://www.genome.gov/). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Söding J, Remmert M. Protein Sequence Comparison and Fold Recognition: Progress and Good-Practice Benchmarking. Curr Opin Struct Biol. 2011;21:404–411. doi: 10.1016/j.sbi.2011.03.005 [DOI] [PubMed] [Google Scholar]
  • 2. Walsh I, Pollastri G, Tosatto SCE. Correct Machine Learning on Protein Sequences: A Peer-Reviewing Perspective. Brief Bioinform. 2015;17:831–840. doi: 10.1093/bib/bbv082 [DOI] [PubMed] [Google Scholar]
  • 3. Jones DT. Setting the Standards for Machine Learning in Biology. Nat Rev Mol Cell Bio. 2019;20:659–660. doi: 10.1038/s41580-019-0176-5 [DOI] [PubMed] [Google Scholar]
  • 4. Walsh I, Fishman D, Garcia-Gasulla D, Titma T, Pollastri G, ELIXIR Machine Learning Focus Group, et al. DOME: Recommendations for Supervised Machine Learning Validation in Biology. Nat Methods. 2021;p. doi: 10.1038/s41592-021-01205-4 [DOI] [PubMed] [Google Scholar]
  • 5.Arpit D, Jastrzebski S, Ballas N, Krueger D, Bengio E, Kanwal MS, et al. A closer look at memorization in deep networks. In: Proc Int Conf Mach Learn. Proc Mach Learn Res; 2017. p. 233–242.
  • 6. Nawrocki EP, Kolbe DL, Eddy SR. Infernal 1.0: Inference of RNA Alignments. Bioinformatics. 2009;25:1335–1337. doi: 10.1093/bioinformatics/btp157 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol. 2011;7:e1002195. doi: 10.1371/journal.pcbi.1002195 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Alley EC, Khimulya G, Biswas S, AlQuraishi M, Church GM. Unified rational protein engineering with sequence-based deep representation learning. Nat Methods. 2019;16:1315–1322. doi: 10.1038/s41592-019-0598-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Bileschi ML, Belanger D, Bryant DH, Sanderson T, Carter B, Sculley D, et al. Using deep learning to annotate the protein universe;BioRxiv [Preprint]. 2019 bioRxiv 626507 [Posted 2019 July 15; cited 2021 July 5]: [28 p.]. Available from: https://www.biorxiv.org/content/10.1101/626507v4.full.pdf. [DOI] [PubMed]
  • 10. Wilburn GW, Eddy SR. Remote homology search with hidden Potts models. PLoS Comput Biol. 2020;16:e1008085. doi: 10.1371/journal.pcbi.1008085 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Muntoni AP, Pagnani A, Weigt M, Zamponi F. Aligning biological sequences by exploiting residue conservation and coevolution. Phys Rev E. 2020;102:062409. doi: 10.1103/PhysRevE.102.062409 [DOI] [PubMed] [Google Scholar]
  • 12. Yang J, Anishchenko I, Park H, Peng Z, Ovchinnikov S, Baker D. Improved protein structure prediction using predicted interresidue orientations. Proc Natl Acad Sci U S A. 2020;117:1496–1503. doi: 10.1073/pnas.1914677117 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A. 2021;118. doi: 10.1073/pnas.2016239118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Karp RM. Reducibility among combinatorial problems. In: Complexity of computer computations. Springer; 1972. p. 85–103. [Google Scholar]
  • 15.Blelloch GE, Fineman JT, Shun J. Greedy sequential maximal independent set and matching are parallel on average. In: Proceedings of the Twenty-Fourth annual ACM symposium on Parallelism in Algorithms and Architectures; 2012. p. 308–317.
  • 16. Métivier Y, Robson JM, Saheb-Djahromi N, Zemmari A. An optimal bit complexity randomized distributed MIS algorithm. Distributed Computing. 2011;23:331–340. doi: 10.1007/s00446-010-0121-5 [DOI] [Google Scholar]
  • 17. El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC, et al. The Pfam Protein Families Database in 2019. 2019;47:D427–D432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Eddy SR. A new generation of homology search tools based on probabilistic inference. In: Genome Informatics 2009: Genome Informatics Series Vol. 23. World Scientific; 2009. p. 205–211. [PubMed] [Google Scholar]
  • 19. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015;12:59–60. doi: 10.1038/nmeth.3176 [DOI] [PubMed] [Google Scholar]
  • 21. Grundy WN. Homology detection via family pairwise search. J Comput Biol. 1998;5:479–491. doi: 10.1089/cmb.1998.5.479 [DOI] [PubMed] [Google Scholar]
  • 22.Shen Z, Liu J, Zhang X, Xu R, Yu H, Cui P. Towards Out-of-Distribution Generalization: A Survey. arXiv. 2021;p. https://arxiv.org/abs/2108.13624.
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009492.r001

Decision Letter 0

Feilim Mac Gabhann, Maricel G Kann

4 Nov 2021

Dear Dr. Petti,

Thank you very much for submitting your manuscript "Constructing benchmark test sets for biological sequence analysis using independent set algorithms" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. In particular, we encourage you to address better explain the meaning of "independent" data and justify the use of percent of Identity (PID) as opposed to an estimated phylogeny to define the split into training and test data.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Maricel G Kann

Associate Editor

PLOS Computational Biology

Feilim Mac Gabhann

Editor-in-Chief

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: This article is a model of clarity in presentation, and I have very few optional suggestions for improvement.

p3, line 71. You may have views about reducing redundancy in the training set, too. It might be interesting to state them.

p5, line 101. The present article presents a complete and unified investigation. In several of your algorithms, however, one could retain some randomness after ordering the candidate lists not by random permutation, but by degree class (small to large) and then permuting randomly within each class of degrees. The effect would be to prioritize vertices of lower degree in the test and training sets, i.e., to ensure inclusion of outliers wherever possible. I suggest future examination of this variant of your algorithms.

p.9, line 135. The article makes a good case for the virtues of randomness, so the previous comment is intended entirely as a suggestion.

p.10, Figure 1. The X-axis is unlabeled. The legend states that it is "Fraction of families successfully split", but adding the axis label would help the top label "Performance..." (which can be retained). I admire the authors' ingenuity in introducing several dimensions of information into the plot.

p.14, Figure 4. PID = percent identity. I could not find the full acronym, which should be explicit.

p.14, line 214. It would be interesting to know if the ordering by degree changed the difficulty of the search.

p.18, line 257. Who benchmarks the benchmarks? The observation that the benchmark changes the difficulty of the test search (and possibly, therefore, the robustness of training) but preserves the relative performance of the algorithms is very reassuring.

Reviewer #2: This study present three new methods, INDEPENDENT SELECTION, BLUE, and COBALT, that enable an input set of sequences (provided in a multiple sequence alignment) to be split into three sets: the training set, the test set, and the discarded set, so that the pairwise identity (PID) between any sequence in the training set and any sequence in the test set is below a user-provided threshold, and the PID between any two test sequences are below another user-provided threshold. The motivation for this approach is the idea that the training data and the test data need to be independent of each other, which the authors note is difficult due to evolution (i.e., all sequences are related to each other because of sharing a common ancestor). These new methods are designed to improve on an earlier algorithm for the same problem, called "CLUSTER", which also aims to achieve this, but these new approaches employ randomness to produce solutions to this problem that might give better downstream results.

There are several parts to the evaluation. First, they examine how many PFAM families are successfully split to produce a training set with at least 10 sequences and a test set with at least 2 sequences. The second evaluation examines the consequences of using these splits of PFAM families for a specific problem (detecting whether a sequence is locally homologous to a given family). Finally, they also evaluate running time. Putting running time aside, they note that BLUE and COBALT are strictly better than CLUSTER with respect to the first two tests (how many families are split and how accurate the subsequent bioinformatics test is using this split). They also note that INDEPENDENT SELECTION is more accurate than BLUE and COBALT, but since INDEPENDENT SELECTION is successful at splitting a much smaller number of families, this is not that important. The running time comparison shows BLUE is more computationally expensive than the remaining methods.

All this is fine. The description of the methods they developed is easy to follow and natural (i.e., they are using straightforward ideas). The writing is good (though there are a few places that could be improved, given below). The results support the claims, generally.

However, I have two questions about the whole approach that I think would merit some discussion. The first question has to do with whether using PID to define the test and training data really gets at "independence", which the authors have justified by noting that two sequences datasets can be non-independent due to shared evolutionary history. Therefore, if the test data form a small clade in the evolutionary tree, separated by a fairly long branch from the training data, they would have perhaps good separation in terms of PID but would not be independent. Wouldn't a better approach be to construct a phylogeny based on the alignment, and then extract two clades from the phylogeny?

The second question has to do with other aspects of defining training data and test data. An important consideration for training data is that it should be representative of the test data. It seems to me that by enforcing a large PID separation between test and training data in a sense is leading to the training data being very different from the test data. This may result in machine learning methods not generalizing well to the test data, since the test data are so different. This hypothesis is consistent with the results shown in the study, since their new methods end up producing splits into test and training data that have *smaller* gaps (in terms of PID) than the original CLUSTER method, and this is used to explain why the new methods have better accuracy. Wouldn't this also potentially suggest that relaxing the required gap between testing and training data even more might further improve accuracy in the subsequent bioinformatics analyses?

In general, therefore, the main questions to the authors are: (1) Is PID a very good proxy for evolutionary independence, and why not instead use an estimated phylogeny to define the split into training and test data? (2) Can the authors discuss the competing objectives in defining the training and test datasets, and characterize the different approaches in terms of these competing objectives?

Finally, there are a few places where the writing could be improved. Some of them are really minor (low-level writing issues), but others have to do with clarity.

1. What are "deep alignments"? (See lines 54 and 59)

2. The problem formulation (beginning of "Results") clearly depends on parameters p and q. Their selection should be discussed, and the problem formulation should note this dependency. Also, since they will remove sequences, the problem formulation is inadequate: they need to note something about removing sequences (perhaps not too many). Then also perhaps something about choosing between alternative splittings of the input: what are the criteria?

3. The choice of what set becomes the training set and which one becomes the test set is not really discussed, beyond saying (lines 34 and 35) "one cluster (usually the largest one) becomes the training set". Is that wise? Isn't there a possibility that the larger one might be the easier or the harder one? Wouldn't picking the set that has the largest amount of heterogeneity provide a better training set?

4. Lines 189-195. This is where the evaluation process is described. It would merit an additional sentence, e.g., "Here we test the ability of HMMER to detect whether a sequence contains a substring that is homologous to a family in PFAM" or perhaps something else? That way the "true negatives" and "true positives" are understood.

Low-level Writing:

• Line 67. "Given set" -> "Given a set".

• Line 143. Unless British English conventions are used, add a comma after "i.e."

• Lines 144-145. By definition, a connected component is a maximal subset that is connected, therefore by definition there cannot be any edge between two connected components. This sentence should be rephrased. For example, replace "partitioned into connected components, such that" by "partitioned into connected components; by definition,"

Reviewer #3: This paper is extremely clear and well-written (mostly, see below).

It concerns getting subsets of sequences that are not too closely

related to each other. This may be of great practical importance for

learning models from training data, and for benchmark assessments.

My main comment is that the motivation, and meaning of "independent"

data, needs better explanation. The usual idea of machine learning is

that same-class objects are nearby in some high-dimensional space, for

some definition of nearby. Does that make them non-independent?

For example, a previous paper ("Benchmarking the next generation of

homology inference tools" Saripella et al. 2016) constructed a

benchmark "aiming to ensure... homologies at different evolutionary

distance". Perhaps that is more representative/realistic, than

ensuring distant homologies only?

In the present paper, the final Discussion paragraph states that

homology search is an "out of distribution" task, but that is not

obviously true. For example, if we learn from mammal sequences then

try to find bacterial sequences, that is indeed out-of-distribution,

but why can't we learn from diverse sequences and thereby be "within

distribution"?

# More minor points:

"Out of distribution" sounds like "transfer learning": may or may not

be worth mentioning.

The "Cluster" method uses one (largest) connected component as the

training set. This seems obviously pessimal, because it tends to

minimize the diversity of the training set. Could this be why the

Cluster benchmark is harder?

Not critical, but it would be interesting to see the runtime breakdown

into the two steps.

The homology benchmark has a severe and easily-fixable problem: the

negative sequences are shuffled thus lack low-complexity/simple

repeats, which are a significant issue in practice. Easy fix: use

reversed instead of shuffled sequences. I realize this is somewhat

tangential to the paper's main point.

It would be helpful to briefly summarize what is known about the

"well-studied" graph problem of independent sets. Maybe mention

maximal versus maximum. Finding a maximum set (or even approximating

it?) is theoretically hard. Are randomized algorithms known to be

advantageous?

Reviewer #4: Description: This paper addresses an important but often ignored problem with protein training and test sets, namely that, when sequences are randomly sampled to each set, the training set often contains nearly the same sequences as the test set. Consequently, protein domain models that are trained on the training set may appear to perform better on the test set than models that would be trained on sequences drawn independently from the underlying distribution. The authors previously addressed this concern using a clustering approach, termed the Cluster algorithm in their paper, that splits the input sequences into test and training sets that share less than a specified percentage of sequence identity. However, they found that this algorithm often fails to split a family due to ‘bridging sequences’ present in the input set. To improve upon current methods, in this study the authors develop two new algorithms, termed Blue and Cobalt, that are derived from “independent set” algorithms in graph theory. The Blue and Cobalt algorithms are better able to split the input set into test and training sets by identifying dissimilar clusters. Blue successfully splits more families than Cobalt, which, however, is computationally more efficient. The authors demonstrate the advantage of these new algorithms by applying them to a large number of Pfam alignments.

Comments: The Blue and Cobalt programs appear to be useful tools for benchmarking search programs and, in particular, are an improvement over the Cluster algorithm, as the authors illustrate using Pfam MSAs. However, this study raises some fundamental questions regarding training and test sets that should at least be discussed.

Comment 1: If an input set is split into two dissimilar sets, it seems likely that these sets will correspond to distinct functionally divergent subgroups. If so, then a protein model obtained using the training set will fail to correspond to the test set: i.e., one will be comparing apples and oranges. In that case a search program might be justified in failing to detect a test sequence given that it belongs to a different subgroup. Of course, what the authors appear to want is a model of the superfamily features upon which one could also superimpose a mixture of the family and subfamily models within that superfamily. Hence, to detect a distant member of the superfamily that corresponds to a subgroup or an orphan sequence absent from the training set, one should perhaps only model the superfamily features. This is often done by purging closely related sequences from the input set so as to retain only the most highly conserved residue positions as well as the more subtle residue propensities at other positions corresponding to the characteristics and the structural core of the superfamily. Alternatively, one could retain all sequences, but down weight them for sequence redundancy rather than purging the set.

Comment 2: Although purging the input set of all but the most diverse sequences seems like a good way to avoid training biases, this can lead to undesirable side effects. For example, input sets often include pseudogene products and sequences corresponding to DNA open reading frames containing frame shifts or splicing artifacts. Since such sequences will lack similarity to functional sequences, they will be overrepresented in the purged set. Hence, focusing too much on the most diverse sequences can, of course, add a significant amount of misleading noise to the training set. Hence, to better capture the key features of both the superfamily and of each subgroup within the superfamily it might make more sense to first group sequences into closely-related clusters (each, to some degree, corresponding to a functionally divergent subgroup) and then select, as a representative for each cluster, that sequence sharing the greatest similarity, on average, with the other sequences in the cluster.

Comment 3: Some protein families are very highly conserved across distantly related organisms. For example, histone H4 from pea and cow differ at only 2 out of the 102 positions. This makes it easy, of course, to distinguish these proteins from unrelated proteins. However, distinguishing such highly conserved protein family members from other related proteins may not be trivial. I assume, of course, that this is not the problem that the authors are seeking to address, even though it too is an important problem when seeking to assign specific functions to hypothetical proteins. I suggest that the authors describe very precisely the nature of the problem that their new programs aim to address. This appears to be the problem of identifying very distant homologs to the more typical members of a protein superfamily. If so, then the families that fail to be split either by the Cluster algorithm or by the Blue/Cobalt algorithms may be ones that are irrelevant to the problem being addressed because these correspond to highly conserved Pfam families.

Comment 4: I suggest that the authors may get a better perspective on these questions by generating simulated sequences from a known distribution (e.g., based on an HMM) and then adding additional sequences by simulating their evolution, which would model the lack of statistical independence that their programs aim to address.

Minor comments:

1. In Figure 3, what do the gray bars above the colored bars that fade out at their top ends indicate?

2. Page 15 line 209: “brought closer the ROC curve” --> “brought closer to the ROC curve”?

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Reviewer #4: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009492.r003

Decision Letter 1

Feilim Mac Gabhann, Maricel G Kann

26 Jan 2022

Dear Dr. Petti,

Thank you very much for submitting your manuscript "Constructing benchmark test sets for biological sequence analysis using independent set algorithms" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Maricel G Kann

Associate Editor

PLOS Computational Biology

Feilim Mac Gabhann

Editor-in-Chief

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #2: The revision addresses the questions I raised, and I appreciate the discussion in the response to review in particular. This has greatly improved the paper.

Reviewer #3: Sorry if I wasn't clear enough last time. I still feel that much of

the writing about "independent" data is confusing, ambiguous, and

misleading. I think there are two different reasons for this.

Reason (1): Independence is relative, not absolute. According to the

mathematical definition, two things A and B are independent when:

P(A|B) = P(A).

But this depends on the "background" probability distribution, P(A).

Consider these two background probability distributions for sequences:

(i) Random i.i.d. sequences, with some length distribution.

(ii) Randomly pick from a set of real biological sequences, which are

related by evolution.

It is possible for sequences to be independent relative to (ii) but

not (i).

Reason (2): While the title and abstract do not specify "homolgy

search" (and indeed the methods may be useful for other things, like

protein localization prediction), the paper focuses on the aim of

homology search. This is confusing, because "homology" means "related

by evolution", which is the same as the non-independence "nuisance"

that they are trying to eliminate!

In short, I feel that the abstract and introduction should be

carefully rewritten to clarify these issues. Surely the authors

understand all this, and it's well-described in the final paragraph of

Discussion. But the Abstract and Introduction are too unclear.

# MINOR

The paper should at least mention the issue of low-complexity/simple

repeats, and that it might affect the relative performances of the

homology search methods. (I suspect DIAMOND may have an advantage in

avoiding simple repeats.)

Page 20: should "the known sequence x" be "the known sequences x"?

Page 20 last line: x and y mixed up?

Reviewer #4: nothing to upload

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

Reviewer #3: No

Reviewer #4: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009492.r005

Decision Letter 2

Feilim Mac Gabhann, Maricel G Kann

10 Feb 2022

Dear Dr. Petti,

We are pleased to inform you that your manuscript 'Constructing benchmark test sets for biological sequence analysis using independent set algorithms' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Maricel G Kann

Associate Editor

PLOS Computational Biology

Feilim Mac Gabhann

Editor-in-Chief

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009492.r006

Acceptance letter

Feilim Mac Gabhann, Maricel G Kann

24 Feb 2022

PCOMPBIOL-D-21-01725R2

Constructing benchmark test sets for biological sequence analysis using independent set algorithms

Dear Dr Eddy,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Anita Estes

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Characteristics of Pfam seed families successfully split.

    Each marker represents a family in Pfam. The connectivity of a sequence is the fraction of other sequences in the seed family with at least 25% pairwise identity. Families successfully split into a training set of size at least 10 and a test set of size at least 2 are marked by a cyan circle, whereas families that were not split are marked by a red diamond. In (B) and (D) the cyan circle represents at least one successful split among 40 independent runs.

    (TIF)

    S2 Fig. Size of training and test sets produced by each algorithm on seed families.

    The two-dimensional normalized histograms illustrate the distribution of training and test set sizes produced by the algorithms among results with at least 10 and 2 training and test sequences respectively. In each plot, the x-coordinate and y-coordinates of the green circle represent the median training and median test set sizes respectively. The white X is placed at the median training and test set sizes among the 2363 families that were successfully split by Blue, Cobalt, and Cluster.

    (TIF)

    S3 Fig. Size of training and test sets produced by each algorithm on full families.

    The two-dimensional normalized histograms illustrate the distribution of training and test set sizes produced by the algorithms among results with at least 400 and 20 training and test sequences respectively. In each plot, the x-coordinate and y-coordinates of the green circle represent the median training and median test set sizes respectively. The white X is placed at the median training and test set sizes among the 1070 families that were successfully split by Blue, Cobalt, and Cluster.

    (EPS)

    Attachment

    Submitted filename: response-to-reviews.pdf

    Attachment

    Submitted filename: response-to-reviews-v2_2_8_22.docx

    Data Availability Statement

    The splitting algorithms are implemented in C and available here: https://github.com/EddyRivasLab/hmmer/tree/develop. To run the algorithms, the following version of EASEL is needed: https://github.com/EddyRivasLab/easel/tree/develop. The code used to generate the figures in this paper is available at https://github.com/spetti/split_for_benchmarks.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES