Abstract
Random peptide libraries that cover large search spaces are often used for the discovery of new binders, even when the target is unknown. To ensure an accurate population representation, there is a tendency to use large libraries. However, parameters such as the synthesis scale, the number of library members, the sequence deconvolution and peptide structure elucidation, are challenging when increasing the library size. To tackle these challenges, we propose an algorithm-supported approach to peptide library design based on molecular mass and amino acid diversity. The aim is to simplify the tedious permutation identification in complex mixtures, when mass spectrometry is used, by avoiding mass redundancy. For this purpose, we applied multi (two- and three-)-objective genetic algorithms to discriminate between library members based on defined parameters. The optimizations led to diverse random libraries by maximizing the number of amino acid permutations and minimizing the mass and/or sequence overlapping. The algorithm-suggested designs offer to the user a choice of appropriate compromise solutions depending on the experimental needs. This implies that diversity rather than library size is the key element when designing peptide libraries for the discovery of potential novel biologically active peptides.
Electronic supplementary material
The online version of this article (10.1186/s13321-019-0347-6) contains supplementary material, which is available to authorized users.
Keywords: Peptide libraries, One-bead-one-compound, Algorithm-supported design, Genetic algorithm, Optimization
Introduction
Small-molecule libraries are widely used in drug discovery to identify biologically active molecules [1]. Traditionally, small molecule library design is based on a known target structure or on known ligands. The last two decades have witnessed the application of docking calculations to the pharmaceutical field through the lead optimization of structure-based design of small molecules [2]. Moreover, molecular docking is the main approach in the structure-based peptidyl-drug design studies, even though peptide based drugs are less explored than the small molecule ones. However, when the targets and/or ligands are unknown or “undruggable” this virtual screening approach fails to provide useful information [3]. Therefore, one has to rely on experimental screenings of random, large numbers of compounds in order to obtain further insight into possible identification of binders or disruptors of protein-protein interactions (PPI) [4, 5].
Safer and more specific alternatives to small molecules, peptide-based drugs are emerging as a new paradigm in medicinal chemistry [6, 7]. Therefore, there is a need to develop combinatorial approaches to identify new peptide therapeutics [5]. In this context, phage display was the main approach to obtain a variety of random peptide sequences [8, 9]. Nowadays, the phage display technology is well established in the peptide-based drug discovery process [10, 11]. However, the limitation was that the library building blocks are limited to the 20 natural proteinogenic amino acids resulting in peptides with short half-lives and susceptible to proteolytic cleavage. Progress with phage display and RNA-display technologies has allowed the use of non-proteinogenic amino acids in those techniques [12], but the one-bead-one-compound (OBOC) method allows a more straightforward use of unnatural amino acids [13, 14]. It allows the introduction of stereochemical variability at the -carbon, head-to-tail cyclization, disulfide bridges, etc. and their combinations without complex genetic manipulations [15]. Therefore, OBOC peptide libraries, have opened up the path to a larger search space for discovery of peptide-based drugs [15, 16].
Chemical and molecular diversity of library components are key for drug discovery [17]. Combinatorial libraries are tools that allow large amount of compounds to be screened at the same time [4]. Currently, the diversity-oriented systems (DOS) for small molecules are taking over the static combinatorial approach, allowing the introduction of skeletal, structural and stereo-chemical complexity [18–20]. Similarly to DOS, one can think of introducing stereo-chemical and skeletal complexity into peptides through the use of D-stereoisomers, the combination of L- and D-ones, the retro-enantio versions and cyclization. By this means, peptide-based libraries with increased stability could be generated. However, their analysis is challenging as similarity in structure and behavior of available peptide permutations would lead to the impossibility to distinguish between sequences using today’s available screening and/or analyzing techniques such as HPLC, UPLC-MS, etc.
It is believed that larger the sample size, more accurate is the representation of a given population [21]. In these terms, random libraries are advantageous over the focused ones, as they cover a larger search space. However, parameters such as the synthesis scale to guarantee the synthesis of all possible compounds, the number of library members, the sequence deconvolution and peptide structure elucidation after screening, are challenging steps when increasing the library size [22]. Moreover, libraries may result in competition among candidates and biased hits identification due to the physico-chemical similarity of the components [21]. This is particularly the case during the sequence deconvolution process of positive hits, where the identification of single peptides is not always possible. Often, families of peptides having identical masses are identified [23]. In order to address this issue, we set out to determine whether multi-objective genetic algorithms can aid and simplify the rational design of random libraries to increase diversity and minimize redundancy. The advantage of this tool is the reduction of the number of library members in a single screening while maintaining maximal chemical and mass diversity.
Recently, artificial intelligence has been applied to a variety of chemical problems to maximize the chance of successful and rapid solving of complex issues [24–28]. In our group, evolutionary algorithms have been successfully applied to address various challenges of peptide chemistry [29–32] related to the design and identification of peptides that cross the blood-brain-barrier (BBB) [33], or that bind to the major histocompatibility complex (MHC) [34]. Recently, Cronin and co-workers used evolutionary algorithms in combination to machine learning to predict antimicrobial activity of peptide sequences [35]. Although the heuristic approach has been explored in the field of peptide design, it is mainly focused on the use of the one-objective setting where one fitness function optimizes a specific peptide property (e.g., antimicrobial, anticancer activity) [36–38]. To date, multi-objective genetic algorithms have been applied to the design of small molecule combinatorial libraries [21, 39–41] but have not been explored for random peptide libraries yet. Herein, we present the framework to solving and/or simplifying a combinatorial challenge—whether we could rationally design a random peptide library and what are the rules that underline this algorithm supported process. For this purpose, we used an evolutionary computing approach based on a multi-objective genetic algorithm (GA) [42, 43] represented in scheme 1 to address the issue of maximizing the number of diverse amino acid permutations. In this way, sequence deconvolution and peptide sequence elucidation are simplified by avoiding sequence redundancy. Ultimately, the designer is able to make an algorithm-supported choice of an appropriate compromise solution among given optimized library designs, depending on the experimental needs.
Results and discussion
The 20 gene-encoded amino acids, together with a variety of non-natural ones, constitute a versatile toolbox for combinatorial library design (Scheme 1a). A high chemical and structural diversity can be achieved by designing libraries composed of sequences of amino acids (peptides) which total number is calculated using the formula , where r is the number of positions where the variability can be introduced and m the number of amino acids per position [22]. Therefore, the number of permutations grows exponentially with the peptide length, depending also on the number of amino acids per position. If all the 20 proteinogenic amino acids are used in r positions, there are possible permutations. OBOC libraries prepared using the “split and mix” methodology result in redundant peptide mixtures consisting of permutations of amino acids having overlapping masses [23].
In our study, the introduction of complexity of the system refers to the molecular weight () diversity of the library components. In the case of peptides, made of combinations of various natural or non-natural amino acids, the diversity often corresponds to sequence diversity. Therefore, based on monoisotopic values we set out to determine an algorithm supported approach to help the random design of systems that maintain maximum mass and sequence diversity. Ultimately, the objective of this paper is to simplify the tedious and sometimes difficult chromatographic and mass spectrometry based analyses of complex mixtures of peptides that have similar properties. We chose the as the discriminant for increasing diversity and complexity of the system, consequently reducing the number of peptides in the mixture.
First, we developed a peptide mass calculator with the possibility to discriminate between masses of peptides in a given mixture, generating an output csv (comma-separated values) file containing all possible peptide permutations alongside with their masses in addition to a csv file containing the list of peptides with unique masses. In the latter, the permutations having the same amino acid composition but with varying locations within the sequence were omitted except one representative of each collision set. The same was applied to peptides having different amino acid composition but overlapping masses, with tolerance set to 1 (see Additional file 1: Fig. S7). Tolerance is an arbitrary parameter that is user defined and it can be tuned to any value, as described in the next subsection. A simplified schematic representation of the calculator output, using a small library of dipeptides is shown in Scheme 1b. The amino acids are divided in 5 different categories (color-coded) depending on their polarity and charge (Scheme 1a). If each of the colors stands for one representative amino acid and all the possible dipeptides are made, there are (r = 2, m = 5) = 25 permutations. After the calculator excludes the peptides with overlapping masses, we are left with a subset of 15 dipeptides of unique masses, as shown in Scheme 1b (right panel). In this way, we could simplify the chromatographic and mass spectrometry analysis of the mixture.
The following step was the development of the optimization tool based on the genetic algorithm (Scheme 1c) that is able to suggest several design strategies based on user’s needs and inputs (Scheme 1d). The input is the same as for the calculator, where the number of positions (r) and the amino acids () for each position have to be defined. The output is a Pareto front containing a distribution of solutions, based on how they scored during the GA selection rounds. The Pareto front is a set of non-dominated solutions chosen as optimal by the GA. Subsequently, the user makes the choice of the preferred design suggestion based on the experimental expertise and the requirements of the library (Scheme 1d). We refer to this choice as algorithm-supported decision. Moreover, we explored the possibility of a three objective algorithm to include also the sequence diversity.
Library mass analysis with the combinatorial calculator
During the initial step of peptide library planning there are two parameters that need to be determined (Fig. 1a): (1) the number of positions r and (2) the list of amino acids that can appear at i-th positions, where . These constitute our input parameters. To simplify and automatize mass calculations of all the possible permutations in a user-defined peptide library, the peptide mass calculator algorithm was developed using the Matlab scripting language. An example is a pentapeptide library where the input was: r = 5, m = 6, for = s, e, r, w, a, G (Fig. 1a). For this condition, the total number of peptide permutations given by is 7776 (Fig. 1b).
The peptide mass calculator also compares the masses of all the possible permutations and locates the ones that have unique mass. The motivation behind this was to estimate whether the analysis of all the compounds with unique masses i.e., simplified libraries in terms of number of peptides in the mixture is feasible by using chromatography coupled mass spectrometry. Hence, the mass of a peptide M is considered unique if no other peptide has the mass within the range . If there is another peptide in the mixture, having mass within this range, it is excluded from the list of permutations unique by mass, thus automatizing redundancy identification. The parameter is referred to as the tolerance of mass discrimination. It is also a user-defined parameter, which may be tuned (according to user’s needs and the resolution of the machine used to perform the mass spectrometry). Throughout this study, the condition was kept in all the calculations and permutations, unless stated otherwise. Using this configuration and the exemplary pentapeptide library from Fig. 1a, the peptide mass calculator computes 130 peptides of unique mass (Fig. 1b).
The calculator output provides a csv (comma-separated value) type file containing the full list of permutations with the corresponding values, in the form of: (1) the average mass, (2) the monoisotopic mass, (3) the singly charged and (4) the doubly charged expected ions for mass spectrometry analysis (Fig. 1c). The algorithm also provides the full list of unique peptide masses (for simplicity only a screenshot is shown in the Fig. 1c).
Therefore, our algorithm gives the possibility to the user to explore all possible permutations as well as the ones baring unique masses, representative of the population. Here, peptide diversity is achieved solely through the introduction of functional group (side chain) diversity and leads to peptide collections with variable mass diversity. This tool is the basis for the optimization process used for the algorithm-supported library design described below.
Library design optimization with the genetic algorithm approach
Next, we developed an optimization decision support system based on the genetic algorithm (GA) approach. GAs are iterative heuristic processes of computational optimization that is based on selection, cross-over and mutation of genes, analogous to the natural process of evolution where the best genes of individuals are preserved and passed to the next generation with the expectation that the fittest parents will give even fitter offspring. In computer science, GAs are mimicking this process by trial and error, gradually finding the best solution or solutions to a given problem and ending only if one of the stopping criteria is met (Fig. 1c): (1) the fitness function of an individual achieved the maximum level or a predefined threshold level, i.e., the algorithm has found the best solution or a solution that is good enough; (2) after a certain number of generations, the fitness function of best solutions is not improved, i.e., the algorithm cannot find another solution that is better than the one it has already found. In each iteration (generation) of a GA, a group (population) of candidate solutions (individuals) to a given optimization problem are considered simultaneously. Each solution is defined by parameters encoded in genes and its quality is quantified by fitness functions. The values of fitness functions are numerical indicators used to compare solutions and find the best ones.
In this paper, the second variant of the Non-dominated Solutions Genetic Algorithm (NSGA-II) was used to run peptide library optimizations. The goal was to identify optimal library designs that yield the greatest number of peptides with unique masses. The motivation was to find library designs with feasible analysis of all its components by chromatography coupled mass spectrometry. The design of a library is defined by the number of positions r and the variability (a number of possible amino acids) in each position . The variability is user defined and consists of a set of possible amino acids, which may vary for each position. The implemented algorithm transforms each set of possible amino acids into a bit-string, where the length of the bit-string is equal to the length of the set and each bit (1 or 0) represents the inclusion (1) or exclusion (0) of a given amino acid. That way, the algorithm is searching for an optimal subset of amino acids. The search is guided by two fitness functions that need to be maximized: (1) the total number of peptides for a given variability and (2) the number of peptides unique by mass. Both fitness functions are computed for each solution examined in this iterative search process by using the peptide mass calculator.
To illustrate the advantage of the optimization output over the single, user-predetermined design, we used the same input shown in Fig. 1a. In contrast to the peptide calculator, the optimization offers a wide range of random peptide library designs simplified to satisfy the maximum mass diversity condition. This can be seen from the Pareto front that contains all the best solutions (red dots) and the distribution of all the remaining solutions from the final population (blue dots) in Fig. 2. We focused our analysis to the 70% to 100% diversity region and reported on the three best solutions encircled in the zoomed Pareto front graph being BS 1 (100%), BS 2 (94%) and BS 3 (92%). Our implementation of the algorithm also allows the user to pick any point of interest and obtain a detailed analysis containing the output of the peptide mass calculator accompanied by the corresponding sequence logo graphs. All the sequence logos throughout the text are presented in the conventional N- to C-terminus fashion.
The library design suggestion for BS 1 consists of 32 peptides having unique mass. The algorithm informs that this specific library can be obtained by simplifying the input from s, e, r, w, a, G in each position xi to {a, G} for , {w, a} for , {r, G} for , {e, r} for and {s, r} for (Fig. 2). Similar design suggestions are shown in Fig. 2 for BS 2 and BS 3, where the number of peptides having unique masses increases as well as the mass overlapping. Depending on users requirements, in terms of the number of library members and the desired chemical diversity, the design strategy can shift towards the increasing number of possible peptide permutations and introduction of limited or extended mass overlapping. From the same optimization, BS 50%, BS 30% and BS 10% points were also analyzed. The zoom of the Pareto front in the 10% to 50% mass diversity range with the corresponding sequence logos can be found in supplementary information (Additional file 1: Fig. S1). Having increased the number of amino acids in each position lead to the greater number of peptides unique by mass in BS 50%, BS 30% and BS 10%. However, the total number of peptides increased at a greater rate, thus reducing the overall mass diversity ratio of these libraries.
The size of search space is the number of all possible solutions for the binary encoded genome calculated as (Fig. 3a). In our case, the complexity of the optimization task is determined by the search space size that is exponentially dependent on the product of m and r. In Fig. 3, we showed how the search space changes from to for a constant value of m () when increasing r from 5 to 10, respectively. The time needed to complete the optimization task is influenced by its complexity (i.e. balance of m and r) and the available computer resources. For peptide lengths from 5 to 10-residues long with varying values of m the algorithm completed the optimization tasks in a reasonable time frame (hours to days) with a standard PC configuration. In contrast to the search space complexity for the given set of libraries (Fig. 3b), all the best solutions (BS 1) have only 32 peptides unique by mass, regardless of the increasing number of positions r (Fig. 3a). A set of sequence logo graphs given in Fig. 3c corresponding to 100% mass diversity shows a set of designs available to the user. Their similarity proves that having an algorithm-supported design strategy still requires carefully chosen inputs from the expert user.
To illustrate the utility of the two-objective optimization, we used the example of an all D-heptapeptide with r = 5 and m = 7, for ; having two fixed positions and . The Pareto front for this optimization can be seen in Fig. 4a. When zoomed into the 70–100% mass diversity region, several design solutions appear. We chose to analyze five points being BS 1 (100%), BS 2 (98%), BS 3 (86%), BS 4 (77%) and BS 6 (70%) and show their sequence logos (Fig. 4b and c) to allow better visualization of the design suggestions and possible synthetic challenges. In our region of interest, the algorithm suggests possible designs to obtain maximally diverse random peptide libraries. Interestingly, BS 2 shows a high mass diversity of 98% and only one peptide with overlapping mass. With 47 peptides unique by mass it offers a design possibility with 33% higher total number of peptides and only 2% mass diversity reduction when compared to BS 1. Thus, this would be our design of choice to attempt the experimental validation. The complete list of peptides for this solution can be found in the Additional file 2: Table S1.
To show the versatility of the optimization process, we presented several other examples of library designs: (a) r = 5, m = 6 for ; in Additional file 1: Fig. S2, (b) r = 7, m = 7 for ; in Additional file 1: Fig. S3, (c) r = 5, m = 10 for in Additional file 1: Fig. S4 and (d) r = 6, m = 10 for ; in Additional file 1: Fig. S5.
Three-objective GA to maximize the complexity through sequence diversity
Among the permutations with overlapping masses taken from the examples examined in the previous section (Fig. 4 and Additional file 1: Figs. S1–S5), some entries exhibited different amino acid composition (Additional file 1: Fig. S7). In order to preserve these solutions in the library design, we introduced a third fitness function to distinguish the permutations by their composition. Consequently, the optimization implemented for combinatorial peptide library design computed the number of peptides of unique amino acid composition and maximized their sequence diversity. In addition, the introduction of sequence diversity as a fitness function enables the 3-objective GA to find new solutions that may have been neglected in the 2-objective setting because of lower mass diversity. An example is a heptapeptide library from the , , optimization where two permutations with different sequence, aGpGery and rGpGisy, show monoisotopic masses of 748.3505 and 748.3869, respectively (Additional file 1: Fig. S7). In this particular case, the monoisotopic mass of ‘ae’ dipeptide (218.116) overlaps with the mass of the ‘is’ dipeptide (218.079). These two peptides would have been excluded in the previous version of the algorithm with , but now they are included because of the increased sequence diversity within the library.
Using the 3-objective GA, the optimization results are displayed in a three-dimensional Pareto front, from which the 2D projections of both amino acid diversity and mass diversity can be extracted and analysed separately (Additional file 1: Fig. S6). In addition to the output obtained with the 2-objective GA, the output of the 3-objective optimization contains a separate csv file listing all the permutations unique by sequence. In this setting, we performed the optimization for two libraries: r = 5, m = 6 for ; in Fig. 5a–c and r = 5, m = 7 for ; , in Fig. 5d–f. We analyzed the output of these examples showing the difference between the sequence and mass diversity of the best solutions, followed by the analysis of the 70% to 100% diversity region of all solutions and finally presenting the sequence logo graphs of two prominent solutions.
The diversity analysis of the libraries (Fig. 5a, d) consists of a parallel representation of sequence and mass diversity distributions on the y-axis for all the best solutions, where each has a different total number of permutations on the x-axis. In these figures, three distinct regions can be observed:
for smaller library sizes, composed of 200 or less permutations, the optimization outputs of mass and sequence diversity overlap;
for medium sized libraries, containing a 1000 or more permutations, the sequence diversity exhibits a faster growth than the mass diversity;
for large libraries with more than 2000 permutations, both diversities saturate and exhibit almost no growth with the increase of library size.
The behavior of the optimization results is summarized in the optimal solutions graph (Additional file 1: Fig. S6d), where two areas of interest are labeled: the overlapping zone (1) where the diversity by sequence and by mass is very similar, and the diversity zone (2) where the diversity by sequence is greater than the diversity by mass. This underlines the overall advantage of the introduction of the third objective in the optimization process that offers a greater possibility to fine-tune the design and fulfill the users’ requirements. As expected, the number of peptides unique by sequence is higher than the number of peptides unique by mass, but both numbers have a theoretical maximum, which is considerably lower than the total number of permutations within the library.
Figure 5b, e present the 2D graphs of the best solutions (green dots) alongside all the solutions (blue dots) from the optimization representing mass diversity on the x-axis and sequence diversity on the y-axis, relative to the total number of permutations. This representation includes an overview of several other design options available that might be of interest to the user and could lead to highly diverse libraries. We focused our interest to the 70% to 100% diversity region and chose only solutions of sequence diversity higher than 95% even if their mass diversity was lower. Two prominent designs are marked and their sequence logo graphs presented in Fig. 5c, f. Both the examples show sequence diversity of 98% while their diversity by mass is 89% and 81%, respectively. Considering that the sequence diversity is a stronger indicator of chemical diversity within the library, it is evident that these two solutions cover a wide range of chemical properties and possibly offer alternative design choices worth of consideration.
In the current system, if the user is interested in making a hydrophobic library, the input should consist of preferentially hydrophobic amino acids. The algorithm-assisted, three-objective optimization offers different possibilities of peptide designs, shown for the high diversity region (70–100%) as well as for the 10–50% diversity region, in Additional file 1: Figs. S8 and S9.
Comparison of two- and three-objective settings and the effect of
Although the computational cost for the three-objective GA optimization is higher, the benefit of increasing the sequence diversity is visible when comparing the output results, i.e., the best solutions of the 3- and 2-GA optimizations. Figure 6 presents this comparison in the 70% to 100% diversity region for two different inputs: (a) r = 5, m = 6 and ; and (b) r = 5, m = 7 and ; , . It can be noticed that the best solutions from the 3-objective optimization (red crosses) are more numerous than the 2-objective ones (blue dots), indicating the coverage of a larger search space and greater design choices. However, the number of solutions with higher diversity is closely related to the chosen value of the tolerance of mass discrimination . As previously mentioned, we set this parameter to to discriminate the permutations by mass. Therefore, should a user require lower or higher tolerance values, the ratio between the number of permutations unique by sequence and unique by mass would change accordingly.
As expected, some of the proposed designs overlap in both optimizations as shown in Fig. 6a, b. This suggests that the choice of the optimization method will depend solely on the users’ requirements. When the mass diversity design is sufficient, the 2-objective optimization will be the method of choice. Should a user require additional design suggestions based on sequence diversity, it can opt for the more costly but more informative 3-objective method.
Next, we explored the effect of varying the tolerance input to values of and for the optimization described in Fig. 4 using both, the two- (Additional file 1: Figs. S10–S14) and the three-objective (Figs. 7, 8 and Additional file 1: S14) settings. When changing to 2, 5, 0.5, 0.1, 0.01 and 0.001, the distribution of best solutions (Fig. 7) and the number of permutations for related mass diversities are affected (Fig. 8). The impact of varying the from 0.001 to 2.5 on the number of best solutions is visible in the optimization results where a smaller tolerance window yields a lower number of best solutions. An example is the high diversity region (sequence diversity above 90%), where the number of best solutions is 10 for T = 0.5, 9 for T = 0.1, 5 for T = 0.01 and only 4 for T = 0.001 (Fig. 7). In addition, when lowering to values close to zero such as 0.001 and 0.01, the algorithm behaves as sequence diversity was the measure of library diversity, i.e. it works according to the 3rd objective. This results in mass and sequence diversity being linearly correlated as seen in Fig. 7d.
In agreement with the spread of pareto fronts shown in Fig. 8, the number of permutations increases for decreasing values of and hence, the pareto front shifts to the right accordingly. By comparing best solutions having 85()% mass diversity (Fig. 8), it can be noticed that the number of permutations increases from 48 (T = 2.5) to 72 (T = 1), 108 (T = 0.5 and T = 0.1), 144 (T = 0.01) and 162 (T = 0.001). For values below 1, i.e., 0.5, 0.1, 0.01 and 0.001, no differences are observed in terms of number of library components for BS1 (100% diversity). On the other hand, when increasing the value to 1 and 2.5, BS 1 (100% diversity) shows a decreasing number of library components from 36 (BS1, ) to 32 (BS1, ).
Furthermore, in the 2-objective setting, by lowering the tolerance window, the number of permutations with unique mass increases, to take into account those sequences that have different amino acid composition but mass difference lower than . As expected, there is little difference in optimization results between values of 0.5 and 0.1. In the high diversity region BS1, BS2, BS3 and BS5 match while BS 4 shows one more peptide unique by mass for . When looking at , the differences are more pronounced in terms of mass diversity and number of library components seen for BS 2, BS 3, BS 4 and BS 5 when compared to values of 0.5 and 0.1 (Additional file 1: Figs. S10–S13).
Conclusions
When dealing with collections of structurally similar compounds, such as peptides that have the same amino acid composition but differential positioning of residues in the sequence (permutations), similar or identical mass, and similar physico-chemical properties, it is challenging to discriminate single permutations with high accuracy. Therefore, diversity rather than library size is the key element when designing random peptide libraries for the discovery of novel biologically active peptides.
In this paper, we have presented an algorithm-supported methodology for the design of random peptide libraries. Basing the methodology on the multi-objective genetic algorithms we achieved libraries with maximal number of peptides that seek to maximal mass and/or sequence diversity. In the two-objective setting, where the goal was mass diversity, the tolerance parameter allows the library designer to define how different two peptides should be in terms of their mass. Moreover, sequence diversity was achieved in the three-objective setting that offered additional library solutions whit similar mass but differential composition of amino acids. This system could be extended to several other fitness functions and their combinations. A possible future multi-objective GA could take into account the charge or the hydrophobicity/philicity of the library components.
Intelligent systems that operate under controlled conditions allow us to explore huge search spaces and offer a large number of possible solutions. It is up to the user to choose the solution of interest. Throughout a series of examples we highlighted the advantage of having numerous design suggestions before attempting any synthesis of complex mixtures. In this way, we could rationally design a library or different pools of smaller libraries—having desired properties and amino acid compositions—based on informed and careful user choices. Thus, this paper demonstrated the necessity of interlinking the advanced computing capabilities of genetic algorithms and the design of peptide libraries. The size of the search space makes heuristic algorithms the method of choice. In fact, the range of possible solutions is difficult to access to an expert user even for smaller scale inputs.
Materials and methods
The definition of input parameters is presented in Algorithm 1, where r is the number of positions, counter is the number of amino acids at each position and full list is the full amino acid list:
The core NSGA-II algorithm is used from the Matlab built-in script gamultiobj. Its default parameters are modified as follows:
‘PopulationSize’ is set to 500,
‘ParetoFraction’ is set to 0.2,
‘PopulationType’ is set to ‘bitstring’.
The binary encoded genome for NSGA-II is created as a bit-string of length:
1 |
The methodology used to compute the three fitness functions (, and ) for each Genome is presented graphically in Fig. 9.
Algorithm 2 (readGenome) is used to transform the genome of a solution obtained by NSGA-II into a algorithm selected list of amino acids (selected_list):
The analysis of the library solution (selected_list) is followed by the computation of the permutations which constitute the Library. Algorithm permutations uses Matlab built-in function ndgrid to make the computation.
Algorithm (getMass) is used to compute the mass of every permutation within the Library using the following equation:
2 |
where M represents the mass of a peptide, represents the mass of amino acid a at position i within the peptide, and is the mass of a molecule of water. Slightly modifying the Eq. 2, the algorithm computes the average mass, the monoisotopic mass, the singly charged or the doubly charged mass of peptides.
The analysis of the Library is followed by the exclusion of peptides of similar mass implemented in Algorithm 3:
where and are masses of permutations at positions i and j within the Library, while is the tolerance of mass discrimination. If the absolute value of mass difference between permutations at positions i and j is less than , then the permutation at position j is removed from the Library list. Algorithm Exclude similar mass repeats this process until all the permutations are compared and only the ones with unique mass remain in the list (Unique by mass).
The analysis of the Library is also followed by the exclusion of peptides of equal sequence implemented in Algorithm 4:
The algorithm Exclude equal sequence uses the algorithm getMass to speed up the search because peptides of similar mass are candidates for peptides of equal sequence. The candidate permutations within the Library need to be transformed into a list of characters and sorted alphabetically. If two peptides have equal list of sorted characters, one of them is excluded from the Library list. Algorithm Exclude equal sequence repeats this process until all the permutations are compared and only the ones with unique sequence remain in the list (Unique by sequence).
Additional files
Authors' contributions
DK conceived and designed the project. GM designed and implemented the algorithms. GM, DK, TT and EG analyzed the data. GM and DK wrote the paper. EG did a critical revision of the manuscript. All authors read and approved the final manuscript.
Acknowledgements
We would like to thank the ERASMUS+ staff mobility program 2017 which founded the mobility for GM and thus allowed the first contact between IRB, Barcelona and UNIRI, Rijeka. We would also like to thank Prof. Tihana Galinac Grbac and EVOSOFT Project (UIP-2014-09-7945) for their valuable support.
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
All the data on which the conclusions of the manuscript rely are thoroughly presented throughout the manuscript. The algorithms needed to reproduce the results are stated in “Materials and methods” section.
Funding
This study was funded by the Ministry of Economy and Competitiveness (MINECO) and the European Fund for Regional Development FEDER (BIO 2016-75327-R, RTC-2015-4336-1, PCIN-2015-052); the Generalitat de Catalunya (XRB and 2014SGR-521); RecerCaixa Foundation and Fundación Ramón Areces. The authors would also like to thank the FP7 Marie Curie Actions (COFUND program; grant agreement no.IRBPostPro2.0 600404) for funding. IRB Barcelona is the recipient of a Severo Ochoa Award of Excellence from MINECO (Government of Spain). This work was supported in part by University of Rijeka under the project number 18.10.2.1.01.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Abbreviations
- GA
genetic algorithm
- NSGA-II
non-dominated solutions genetic algorithm II
- csv
comma-separated values
- BS
best solution
- MW
molecular weight
- OBOC
one-bead-one-compound
- DOS
diversity-oriented systems
- HPLC
high performance liquid chromatography
- UPLC-MS
ultra-high performance liquid chromatography-mass spectrometry
- BBB
blood-brain-barrier
- MHC
major histocompatibility complex
References
- 1.Hajduk P, Galloway WR, Spring DR. A question of library design. Nature. 2011;480:42–43. doi: 10.1038/470042a. [DOI] [PubMed] [Google Scholar]
- 2.Kitchen DB, Decornez H, Furr JR, Bajorath J. Docking and scoring in virtual screening for drug discovery: methods and applications. Nat Rev Drug Discov. 2004;3(11):935–949. doi: 10.1038/nrd1549. [DOI] [PubMed] [Google Scholar]
- 3.Dandapani S, Marcaurelle LA. Accessing new chemical space for “undruggable” targets. Nat Chem Biol. 2010;6(12):861–863. doi: 10.1038/nchembio.479. [DOI] [PubMed] [Google Scholar]
- 4.Kodadek T. The rise, fall and reinvention of combinatorial chemistry. Chem Commun. 2011;47:9757–9763. doi: 10.1039/c1cc12102b. [DOI] [PubMed] [Google Scholar]
- 5.Falciani C, Lozzi L, Pini A, Bracci L. Bioactive peptides from libraries. Chem Biol. 2005;14(4):417–426. doi: 10.1016/j.chembiol.2005.02.009. [DOI] [PubMed] [Google Scholar]
- 6.Vlieghe P, Lisowski V, Martinez J, Khrestchatisky M. Synthetic therapeutic peptides: science and market. Drug Discov Today. 2010;15(1–2):40–56. doi: 10.1016/j.drudis.2009.10.009. [DOI] [PubMed] [Google Scholar]
- 7.Sato AK, Viswanathan M, Kent RB, Wood CR. Therapeutic peptides: technological advances driving peptides into development. Curr Opin Biotechnol. 2006;17(6):638–642. doi: 10.1016/j.copbio.2006.10.002. [DOI] [PubMed] [Google Scholar]
- 8.Ladner RC, Sato AK, Gorzelany J, De Souza M. Phage display-derived peptides as therapeutic alternatives to antibodies. Drug Discov Today. 2004;9(12):525–529. doi: 10.1016/S1359-6446(04)03104-6. [DOI] [PubMed] [Google Scholar]
- 9.Craik DJ, Fairlie DP, Liras S, Price D. The future of peptide-based drugs. Chem Biol Drug Des. 2013;81(1):136–147. doi: 10.1111/cbdd.12055. [DOI] [PubMed] [Google Scholar]
- 10.Molek P, Strukelj B, Bratkovic T. Peptide phage display as a tool for drug discovery: targeting membrane receptors. Molecules. 2011;16(1):857–887. doi: 10.3390/molecules16010857. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Hamzeh-Mivehroud M, Alizadeh AA, Morris MB, Bret Church W, Dastmalchi S. Phage display as a technology delivering on the promise of peptide drug discovery. Drug Discov Today. 2013;18(23–24):1144–1157. doi: 10.1016/j.drudis.2013.09.001. [DOI] [PubMed] [Google Scholar]
- 12.Passioura T, Suga H. A rapid way to discover nonstandard macrocyclic peptide modulators of drug targets. Chem Commun. 2017;53(12):1931–1940. doi: 10.1039/c6cc06951g. [DOI] [PubMed] [Google Scholar]
- 13.Lam KS, Salmon SE, Hersh EM, Hruby VJ, Kazmierski WM, Knapp RJ. A new type of synthetic peptide library for identifying ligand-binding activity. Nature. 1991;354(6348):82–84. doi: 10.1038/354082a0. [DOI] [PubMed] [Google Scholar]
- 14.Meldal MP. The one-bead two-compound assay for solid phase screening of combinatorial libraries. Biopolymers. 2002;66(2):93–100. doi: 10.1002/bip.10229. [DOI] [PubMed] [Google Scholar]
- 15.Liu R, Li X, Xiao W, Lam KS. Tumor-targeting peptides from combinatorial libraries. Adv Drug Deliv Rev. 2017;110–111:13–37. doi: 10.1016/j.addr.2016.05.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Christensen C, Groth T, Bruun Schiødt C, Tækker Foged N, Meldal M. Automated sorting of beads from a “one-bead-two-compounds” combinatorial library of metalloproteinase inhibitors. QSAR Comb Sci. 2003;22(7):737–744. [Google Scholar]
- 17.Schreiber SL. Organic chemistry: molecular diversity by design. Nature. 2009;457(4226):153–154. doi: 10.1038/457153a. [DOI] [PubMed] [Google Scholar]
- 18.Galloway WRJD, Isidro-Llobet A, Spring DR. Diversity-oriented synthesis as a tool for the discovery of novel biologically active small molecules. Nat Commun. 2010;1(6):1–13. doi: 10.1038/ncomms1081. [DOI] [PubMed] [Google Scholar]
- 19.O’ Connor CJ, Beckmann HSG, Spring DR. Diversity-oriented synthesis: producing chemical tools for dissecting biology. Chem Soc Rev. 2012;41(12):4444. doi: 10.1039/c2cs35023h. [DOI] [PubMed] [Google Scholar]
- 20.Wang Y, Madsen AØ, Diness F, Meldal M. Diversity-oriented syntheses by combining cuaac and stereoselective incic reactions with peptides. Chem Eur J. 2017;23(56):13869–13874. doi: 10.1002/chem.201702900. [DOI] [PubMed] [Google Scholar]
- 21.Wright T, Gillet VJ, Green DV, Pickett SD. Optimizing the size and configuration of combinatorial libraries. J Chem Inform Comput Sci. 2003;43(2):381–390. doi: 10.1021/ci0255836. [DOI] [PubMed] [Google Scholar]
- 22.Zhao PL, Nachbar RB, Bolognese JA, Chapman K. Two new criteria for choosing sample size in combinatorial chemistry. J Med Chem. 1996;39(2):350–352. doi: 10.1021/jm950054x. [DOI] [PubMed] [Google Scholar]
- 23.Guixer B, Arroyo X, Belda I, Sabidó E, Teixidó M, Giralt E. Chemically synthesized peptide libraries as a new source of bbb shuttles. Use of mass spectrometry for peptide identification. J Pept Sci. 2016;22:577–591. doi: 10.1002/psc.2900. [DOI] [PubMed] [Google Scholar]
- 24.Granda JM, Donina L, Dragone V, Long D-L, Cronin L. Controlling an organic synthesis robot with machine learning to search for new reactivity. Nature. 2018;559:377–381. doi: 10.1038/s41586-018-0307-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Graulich N, Hopfb H, Schreiner PR. Heuristic thinking makes a chemist smart. Chem Soc Rev. 2010;39:1503–1512. doi: 10.1039/b911536f. [DOI] [PubMed] [Google Scholar]
- 26.Krivák R, Hoksza D. P2rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. J Cheminform. 2018;10(39):1–12. doi: 10.1186/s13321-018-0285-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Trobe M, Burke MD. The molecular industrial revolution: automated synthesis of small molecules. Angew Chem Int Ed. 2018;59(16):4192–4214. doi: 10.1002/anie.201710482. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Gromski PS, Henson AB, Granda JM, Cronin L. How to explore chemical space using algorithms and automation. Nat Rev Chem. 2019;3:119–128. [Google Scholar]
- 29.Belda I, Madurga S, Llorà X, Martinell M, Tarragó T, Piqueras MG, Nicolás E, Giralt E. Enpda: an evolutionary structure-based de novo peptide design algorithm. J Comput Aided Mol Des. 2005;19(8):585–601. doi: 10.1007/s10822-005-9015-1. [DOI] [PubMed] [Google Scholar]
- 30.Belda I, Llorà X, Giralt E. Evolutionary algorithms and de novo peptide design. Soft Comput. 2006;10(4):295–304. [Google Scholar]
- 31.Belda I, Madurga S, Tarragó T, Llorà X, Giralt E. Evolutionary computation and multimodal search: a good combination to tackle molecular diversity in the field of peptide design. Mol Divers. 2007;11(1):7–21. doi: 10.1007/s11030-006-9053-1. [DOI] [PubMed] [Google Scholar]
- 32.Teixidó M, Belda I, Zurita E, Llorá X, Fabre M, Vilaró S, Albericio F, Giralt E. Evolutionary combinatorial chemistry, a novel tool for sar studies on peptide transport across the blood-brain barrier. Part 2. Design, synthesis and evaluation of a first generation of peptides. J Pept Sci. 2005;11(12):789–804. doi: 10.1002/psc.679. [DOI] [PubMed] [Google Scholar]
- 33.Teixidó M, Belda I, Roselló X, González S, Fabre M, Llorá X, Bacardit J, Garrell JM, Vilaró S, Albericio F, et al. Development of a genetic algorithm to design and identify peptides that can cross the blood-brain barrier. QSAR Comb Sci. 2003;22(7):745–753. [Google Scholar]
- 34.Madurga S, Belda I, Llorà X, Giralt E. Design of enhanced agonists through the use of a new virtual screening method: application to peptides that bind class I major histocompatibility complex (mhc) molecules. Protein Sci. 2005;14(8):2069–2079. doi: 10.1110/ps.051351605. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Yoshida M, Hinkley T, Tsuda S, Abul-Haija YM, McBurney RT, Kulikov V, Mathieson JS, Galiñanes Reyes S, Castro MD, Cronin L. Using evolutionary algorithms and machine learning to explore sequence space for the discovery of antimicrobial peptides. Chem. 2018;4:533–543. [Google Scholar]
- 36.Fjell CD, Jenssen H, Cheung WA, Hancock RE, Cherkasov A. Optimization of antibacterial peptides by genetic algorithms and cheminformatics. Chem Biol Drug Des. 2011;77(1):48–56. doi: 10.1111/j.1747-0285.2010.01044.x. [DOI] [PubMed] [Google Scholar]
- 37.Porto WF, Irazazabal L, Alves ES, Ribeiro SM, Matos CO, Pires ÁS, Fensterseifer IC, Miranda VJ, Haney EF, Humblot V, et al. In silico optimization of a guava antimicrobial peptide enables combinatorial exploration for peptide design. Nat Commun. 2018;9(1):1490. doi: 10.1038/s41467-018-03746-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Neuhaus CS, Gabernet G, Steuer C, Root K, Hiss JA, Zenobi R, Schneider G. Simulated molecular evolution for anticancer peptide design. Angew Chem Int Ed. 2019;58(6):1674–1678. doi: 10.1002/anie.201811215. [DOI] [PubMed] [Google Scholar]
- 39.Gillet VJ, Khatib W, Willett P, Fleming PJ, Green DV. Combinatorial library design using a multiobjective genetic algorithm. J Chem Inform Comput Sci. 2002;42(2):375–385. doi: 10.1021/ci010375j. [DOI] [PubMed] [Google Scholar]
- 40.Meinl T, Ostermann C, Berthold MR. Maximum-score diversity selection for early drug discovery. J Chem Inf Model. 2011;51(2):237–247. doi: 10.1021/ci100426r. [DOI] [PubMed] [Google Scholar]
- 41.Nicolaou CA, Brown N. Multi-objective optimization methods in drug design. Drug Discov Today Technol. 2013;10(3):427–435. doi: 10.1016/j.ddtec.2013.02.001. [DOI] [PubMed] [Google Scholar]
- 42.Mauša G, Galinac Grbac T. Co-evolutionary multi-population genetic programming for classification in software defect prediction: an empirical case study. Appl Soft Comput. 2018;55:331–351. [Google Scholar]
- 43.Coello CAC, Lamont GB, Van Veldhuizen DA. Evolutionary algorithms for solving multi-objective problems. Boston: Springer; 2007. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All the data on which the conclusions of the manuscript rely are thoroughly presented throughout the manuscript. The algorithms needed to reproduce the results are stated in “Materials and methods” section.