Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2026 Feb 15:2026.02.14.705946. [Version 1] doi: 10.64898/2026.02.14.705946

Universal Baseline for in vitro Selection of Genetically Encoded Libraries

Kejia Yan a, Guilherme M Lima a, Tara Bahadur b, Vincent Albert b, Zoe O’Gara b, Gary Bao b, Christin Kossmann c, William Kirby b, Fernando B Mejia d, Matthew L Michnik b, Kristen Maiorana b, Ratmir Derda a,b,*
PMCID: PMC12918800  PMID: 41727091

Abstract

Genetically encoded (GE) libraries enable identification of high-affinity ligands for diverse molecular targets through iterative in vitro selection and DNA sequencing or next-generation sequencing (NGS). Despite their impact in therapeutic development, a systematic framework for evaluating reproducibility in GE-molecular discoveries remains limited. To aid such analysis, we introduce the concept of baseline response, which reproducibly partitions active and inactive members of in vitro selection. The baseline response is provided by spiking a random DNA-barcoded population. We calibrated the baseline concept using Bioconductor EdgeR differential enrichment (DE) analysis of NGS of phage-displayed selection on oligosaccharide chitin and hepatitis virus NS3a* protease as model targets. We further show that mixing discovery campaigns also offers an effective baseline: chitin-enriched peptides serve as a baseline for DE-analysis of NS3a* selection and NS3a*-enriched peptides serve as a baseline for chitin binders. We applied baseline-stratified DE-analysis to 66 parallel selections performed in 3–5 replicates across 22 extracellular targets, including HER1–3, EpCAM, CAIX, PD-L1, and eight integrin receptors. Automated DE-analysis across hundreds of NGS files produced hits validated in a secondary screen and yielded synthetic macrocyclic ligands with mid-nanomolar affinity confirmed in 2–3 biophysical assays. For PD-L1, we further demonstrated how baseline-calibrated NGS data provide decision-enabling information for optimization of peptide macrocycles to yield potent single-digit nanomolar ligands for the cell-surface receptor. We anticipate that baseline-based analyses of NGS data from in vitro selection procedures will offer a scalable framework for reproducible hit discovery and standardized analysis across diverse in vitro selection campaigns.

Keywords: Biological Sciences, Biochemistry, baseline, phage display, diversity, reproducibility

Introduction

In a global therapeutic market, a substantial fraction of approved therapeutics and clinical candidates undergoing clinical trials originate from molecules discovered in genetically-encoded selection processes (1, 2). Phage display (3) and related display technologies (ribosome, mRNA-, and cell-display) (47), aptamer selection (a.k.a. SELEX) (8), and selection from DNA-encoded libraries (DEL) (2, 9) all represent the same process of DNA-encoded molecular discovery performed by iterative in vitro selection. This process usually starts from million-to-billion scale diverse libraries of DNA-encoded molecules and aims to enrich a subset of these molecules that bind to a target of interest. Most approaches, with exception of traditional DEL, employ multiple rounds of in vitro selection and amplification cycles. At each selection cycle, a collection of encoded molecules is incubated with a target displayed on the surface of the microtiter well plate, bead, or cell surface. This selection is conceptually identical to a traditional binding assay or high-throughput screening (HTS) (10) but fundamental differences exist in biophysical analysis of these processes. Reproducibility and baseline response are fundamental in the analysis of binding assays, the same analysis of reproducibility over a baseline population gives rise to robust statistical analyses of HTS based on binding assays; examples are Z’ score analysis of HTS assays. To allow equivalent assessment of the reproducibility in molecular discovery emanating from in vitro selection of DNA-encoded libraries, the field critically lacks a unified concept of discovery baseline which can be universally employed in any molecular discovery powered by in vitro selection.

To our knowledge, discussion of reproducibility, baseline and robust statistics derived from thereof are scarce in publications that employ in vitro selections: Krusemark and co-workers employed Z’-type analysis to analyze DEL selection (11) and our lab repurposed Z’ analysis to test robustness of selection of phage-displayed glycans (12). Phage display literature evaluate convergent selection to the same target-unrelated peptides (13), or the reproducible discovery of parasitic sequences (14). Common approach to assess robustness of selections in peptide-display and antibody-display literature involves monitoring enrichment of pre-defined binding clones spiked into a naïve library (3), or reproducible emergence of specific motifs, such as streptavidin-binding peptides with HPQ motif (15), integrin-binding peptides with RGD-motif (1618) or known antigen sequences (19). Similarly, DEL and SELEX publications often benchmark selection procedures by analyzing emergence of hits for clinically-relevant targets (undruggable transcription factors (20), prostate-specific antigens (21), bromodomain (22, 23), carbonic anhydrase (24), thrombin (25), etc.) Several publications explored reproducibility of discoveries as a function of display formats (26) or as a function of the size of mRNA-displayed libraries (27) or phage-displayed libraries (28), or SELEX libraries (29). In the latter example, Szostak and co-workers found relationships between the complexity of the library, robustness of the selection and the information content of the evolved aptamers (30). Convergence of parallel evolution of proteins (31), ribozymes (32), and phenotypic outcomes in in different species (33) have been reported. Notable example from Liu and co-workers quantified reproducible convergence of selection trajectories in continuous phage assisted selections of T7 RNA polymerase (34). These and other analyses rely on observation of specific molecular outcomes, such as reproducible enrichment of defined sequences, molecular structures, or motifs. Recent translation to automation of large NGS datasets emanating from selection procedures and use of such datasets for machine learning (ML) requires robust stratification of active and inactive populations in such datasets while limiting the number of false-positive and false-negative hits (35, 36). An anticipated sequence-specific-outcomes or predefined targets-specific decisions can improve analyses of the selection procedures, but an unbiased analysis of in vitro selection can be improved with a universal, composition-agnostic analysis methods. We strived to devise the analysis of the reproducibility of molecular discoveries that does not rely on any analysis of molecular composition or pre-defined knowledge of molecular structure. For example, the analysis of reproducibility in binding assays is independent of the molecular identity of binding partners. Instead, binding assays employ the “baseline response” as a foundation for statistical analysis (37, 38). Agglomeration of binding data stratified by baseline is the foundation of all large-scale analysis of binding data (e.g., HTS), structure activity relationship (S.A.R.) derived from such data and Machine Learning approaches that learn from data (39, 40). In this manuscript, we implemented a universal experimental baseline to quantify and normalize molecular discovery outcomes across multiple targets and multiple in vitro selection procedures.

Immunization of animals is an example of a molecular discovery process that employs a universal baseline established nearly half-a-century ago. In discovery of antibody A to an antigen X, baseline response is produced by sampling antibodies from the serum of non-immunized animal and measuring their binding to antigen X (41). From here one can extrapolate that a baseline response B for any molecular discovery that employs a mixed molecular library [L] and in vitro selection towards target T. Specifically, B is binding of collection of molecules randomly sampled from [L] towards target T. By analogy, we demonstrate how to implement such baselines in discoveries that employ randomized library of peptides [P] displayed on phage. While the rest of the manuscript focuses on phage display and peptide libraries, we note that identical principle can be used to define a baseline for any in vitro selection procedure that employs large mixed populations of encoded molecules and iterative multi-round selection procedures.

A bona fide random baseline sampled from a native molecular library [L] contains both active and inactive components. A distribution of the affinities in such random naïve samples have been estimated in a hallmark publication by Lancet et al. (42) who measured the shape of the distribution in the naïve and selected populations of antibodies by equilibrium dialysis. By definition, in vitro selection eliminates the non-fit members of the library. Hence, after several rounds of in vitro selections, the baseline responses no longer exist in such selected populations. In this closed population it is often impossible to establish a robust baseline response because all members of this population exhibit a various degree of binding to the desired target. To restore this lost baseline, we spiked a universal DNA-barcoded subset of random peptide sequences into every step of the molecular discovery pipeline. This random baseline population, distinguishable from the remainder of the population through silent codon barcoding (43, 44), delineates a true baseline response from target-specific selections. With this approach, we performed selections across 24 protein targets and observed that the presence of a universal in situ baseline enables quantitative assessment of enrichment fidelity, reproducibility and cross-sectional analysis of discoveries performed for different targets and different library compositions. By decoupling selection-specific effects from background noise, our baseline strategy establishes a statistical foundation for benchmarking in vitro selections and provides a missing component for reproducibility and data normalization in genetically encoded molecular discovery pipelines.

Results

Spiked non-binding baseline into pre-enriched library populations

We tested whether a spiked non-targeted (baseline) library provides any additional information in a traditional selection of phage displayed library. As a model, we utilized a previously characterized selection (45) against hepatitis C virus NS3a variant (NS3a*) immobilized on streptavidin magnetic beads (SMB). NS3a*-bound peptides were eluted using grazoprevir (Fig. 1A). In a standard procedure without baseline (Fig. 1AB), hits were identified as peptides significantly enriched in differential enrichment (DE) analysis of NGS data from the third round of selection (R3). The scatter plot of input vs. output NGS reads (Fig. 1C) had a characteristic bimodal distribution: the peptides that pass DE analysis reside in the area clearly separated from the peptides that reside on the diagonal. The latter “on-diagonal” population contains peptides that have similar input and output NGS reads. In addition to selection on NS3a* target, we employed “control” selection procedures, here SMB without NS3a*, to delineate the peptides enriched due to association with NS3a* form those that might be binding to SMB. The scatter plot for this control selection (Fig. 1D) has the same bimodal shape with clearly separated SMB-enriched “off diagonal” and non-enriched “on diagonal” population. Analysis of Figure 1cd concludes that populations in “target” and “control” selections appear to be similar with one exception: more peptides reside in the off-diagonal (enriched) population in target selection. Peptides enriched on NS3a* but not SMB contained a recurrent DMT motif, which was previously confirmed to be a hallmark of NS3a*-binding macrocycles.

Figure 1.

Figure 1.

Establishing a baseline for evaluating selection quality using a mixed non-binding control library in the final round of selection. A) Procedure of final round selection without baseline. B) Flow chart of panning campaign against NS3a* without baseline. C) Scatter plot of normalized output versus input read counts against target. Peptides enriched in output (green rectangle) relative to input are typically considered selection hits. D) Scatter plot of normalized output versus input read counts against control (streptavidin). E) Flow chart of panning campaign against NS3a* with baseline. F) Procedure of final round selection with a spiked-in non-binding control library (gray). G) Library composition after selection between target and control. After selection on the target, the baseline library is nearly absent (<1%), indicating strong selective enrichment on the target. H-I) Scatter plot of output vs input reads after normalization. Gray dots represent baseline peptides. J-K) Bar plots of peptide enrichment after binning by input abundance. The y-axis shows log (FC+0.001), where FC is fold change (output/input), to allow visualization of peptides with low or zero abundance in output. L-M) histogram of peptide enrichment originated from input reads bin 10–32 (L) and 11–38 (M).

We then repeated the same selection with a baseline control spiked prior to round 3 selection (Fig. 1EF). The baseline was a random sample of peptides with isosteric population of disulfides and orthogonal silent barcode. Baseline and R3 input were mixed in a 70:30 ratio prior to round 3 of selection (Fig. 1F). Analysis of round 3 after on target selection showed that the baseline library represented only 0.04% of total reads (Fig. 1G) indicating a 70/0.04 = 1750-fold depletion when compared to peptides pre-selected to bind to NS3a*. In the control selection, baseline library represented 50% of the reads, and 70/50 = 1.4-fold depletion (Fig. 1G). The scatter plot of the NGS reads highlighted substantial difference between the “target” and “control” selections. In “control” selection baseline coincided with on-diagonal population (Fig. 1I). In selection on NS3a*, baseline peptide sequences were separated from the “diagonal” population by several orders of magnitude (Fig. 1H). Red rectangle highlights both R3 and baseline peptides that have the same copy number in the input (~100 parts per million); after selection on NS3a*, few baseline peptides have CPM=1–3 and most have CPM<0; whereas for all R3 peptides CPM ranges from 20 to 1000. A selection with baseline (Fig. 1H) unveiled an observation missed in the selection without the baseline (Fig. 1C): an average peptide in R3 is a significantly stronger binder to NS3a* than a random population of peptides. The fundamental lesson is that peptides that do not exhibit enrichment in late rounds of multi-round selection cannot be equated to non-binders. This observation is a classical red queen effect: the peptides need to interact significantly with the NS3a* simply to retain their constant position in the library. Baseline peptides that do not bind to NS3a* are falling behind or are removed from the selection (CPM<0).

To quantify whether enrichment is associated with specific sequence motifs, we employed Bioconductor-DE analysis (14, 46) to calculate the fold change (FC) from output and input reads for each peptide. Reads were binned based on their input abundance, and FC was plotted across bins. We separated peptides with DMT (yellow bars) from those that contain no DMT motif (blue bars) from “baseline” (grey bars). During selection on NS3a*, both DMT-containing peptides and non-DMT sequences were separated from the baseline. Although DMT-containing peptides exhibited higher median FC values in each bin (Fig. 1J and L), there existed a clear population of non-DMT peptides that enriched several orders of magnitude above the baseline. These observations show that focusing analysis to recognizable peptide motifs (47) may be counterproductive. Indeed, we have previously demonstrated that non-DMT sequences from NS3a* selection have confirmed binding to NS3a* (45). In the control selection, FC of DMT-containing peptides and non-DMT sequences was indistinguishable from the baseline population, indicating lack of binding to SMB (Fig. 1K and M).

To demonstrate the use of baseline in evaluation of robustness of the selection, we compared NS3a* selection from disulfide libraries to the same selection that employed libraries chemically modified by metabromoxylene (MBX). Spiking a baseline population at the third round of selection uncovered an intriguing observation that MBX-modified library exhibited an inferior selection outcome. In the R3 of the selection, the enrichment factors of >90% of the population of MBX-modified macrocycles were indistinguishable from enrichment of random baseline population (Fig. S1D). We compared this data to the enrichment factors of peptides of the same abundances the R3 of selection on NS3a* that employed unmodified disulfide macrocycle: 99% of the powered clearly separated from the baseline population (Fig. S1C). The most abundantly present peptides at R3 in both selections contained DMT-motifs (Fig. S1CD), hence, motif-based heuristic assessment of this discovery campaigns can deemed both of them as “success”. Nevertheless, calibration of these selections against the same baseline uncovered significant decrease in the enrichment of the MBX-modified library because only a minor subset of sequences in R3 exhibited an enrichment above the “baseline response”. This suboptimal performance might be the result of the known ability of certain cysteine-reactive crosslinkers such as MBX to decrease the infectivity of phage (48); and it is possible that compromised infectivity of phage limits the efficiency of multi-round selection.

Forging baseline in selection by mixing selections from unrelated targets

A binned analysis in a previous section (Fig. 1JK) illustrates that a robust baseline should not only contain a random population of peptides, but also exhibit peptide abundances (measured as counts per million (CPM)) that resemble those of the preselected populations. Such CPM-matched random baseline is not always readily available. A sample of the naïve library does not form robust baseline because all peptides in such population have low CPM. We hypothesized that in selection on unrelated targets T1 and T2, selected population S1 can serve as a baseline and selection for target T2 and vice versa. A peptide population selected on unrelated targets is likely to contain peptides with high and low CPM and, in principle, it serves as a functional baseline. Blending of unrelated selections is counterintuitive: why contaminate the library with unrelated binders after it has been carefully selected to enrich only useful sequences? Nevertheless, the “in silico mixing” of hits from unrelated campaigns is an acceptable practice: for example, a recent preprint from Dickinson and co-workers trained contrastive learning models by combining data from binding campaigns on protein P1 (“binder” population) and data from a separate binding campaigns on unrelated protein P2: peptides enriched in P2 and not P1 campaign we denoted as “non-binder” population (35).

To test the benefits of physical mixing of unrelated libraries, we conducted a separate phage displayed campaign starting from SX2CX8CX2 library to identify peptides that bind polysaccharide chitin (Fig. 2A). This selection employed chitin beads and traditional acid elution. In short, the selection was successful because R3 library clearly separated from the baseline population (Fig. 2B). A dominant HPV motif emerged among the top enriched peptides (Fig. S2A) and a synthetic biotinylated chitin-binding peptide isolated from the campaign bound to chitin magnetic beads as measured by fluorescence (Fig, S2BC), and chitin-like glycans but not to unrelated glycans in glycan array (Fig. S3). Given the distinct molecular characteristics of chitin (a polysaccharide) and NS3a* (a protein), we postulated that libraries selected against these targets may thus serve as mutual baselines. To test the cross-target baseline concept, we blended SX2CX8CX2 library from R3 input of chitin selection and the SX3CX9C library from R3 input of NS3a* selection and further spiked a bona fide random baseline used above. The combined library was used as input for round 3 selections against either chitin or NS3a* (Fig. 2B). In selections against NS3a*, the average fold change of peptides from the chitin-selected library closely matched that of random baseline across all input bins (Fig. 2B), indicating that chitin-binding peptides do not exhibit specific binding to NS3a* and functioned effectively as a baseline. In the reciprocal selection against chitin, the SX3CX9C library (pre-selected for NS3a*) was indistinguishable from bona fide baseline (Fig. 2C) whereas chitin-selected population contained populations of peptides that were clearly separated from both baseline populations. These results justify the blending of libraries derived from unrelated selection campaigns as practical implementation of baselines. Encouraged by these results, we push the cross-target baseline concept to campaigns with 22 protein targets.

Figure 2.

Figure 2.

Using library selected for a different target as baseline. A) Procedure of selection on chitin (a polysaccharide) with baseline B) Procedure of selection on structurally different targets: NS3a* (a hepatitis C viral protease) and chitin. In each case, in round 3, the target-specific library was mixed with two controls: a non-binding baseline and a library previously selected on a different target. Selections were then performed separately on NS3a* and chitin. C) Scatter plot of normalized output versus input read counts against chitin. D) Histogram of peptide enrichment originated from input reads bin 17–65. E-F) Bar plots showing enrichment after binding by input abundance. In the NS3a* selection (E), the NS3a*-enriched library displayed strong divergence from both the non-binding baseline and the chitin-selected library, indicating target-specific enrichment. In contrast, during selection on chitin (F), the three libraries (chitin-selected, NS3a*-selected, and baseline) showed similar behavior, suggesting minimal selective enrichment and poor discrimination between peptide populations. G) Histogram of peptide enrichment originated from input reads bin 25–96. Note that in the highest frequency bin, the peptide from bona-fide populations were missing but NS3A*-selected peptides offered a convenient baseline response.

Use baseline in semi-automated analysis of parallel selection campaigns

Presence of constant baseline in the molecular discovery procedure makes it possible to standardize the analysis of the data emanating from multiple selections. As an illustration, we performed 66 parallel selections using 22 protein targets and 3 phage-displayed libraries (Fig. 3AD). Each protein was expressed by a commercial supplier as a fusion with a His-Avi tag for oriented immobilization on streptavidin or Ni-NTA. Most cell-surface proteins in this campaign are targets of interest for development of peptide-based radiopharmaceuticals (49). In the first round of selection (R1), targets were immobilized on the surface of streptavidin-coated 96-well plate and incubated with the phage-displayed libraries (Fig. 3A). After wash and elution and amplification, the libraries of different architecture selected on the same target were pooled and used as an input for the second round of selection (R2 Fig. 3AB). The R2 employed protein immobilized on Ni-NTA beads (Fig. 3B) and an automated bead-washing, elution, and amplification workflow to produce R3 input for each target. Instead of using them individually, we combined R3 inputs from multiple selection. These combined inputs were employed in the third round of selection (Fig. 3BC). The NGS data from R3 input and output from panning on the target and control (streptavidin beads without target) were processed using DE algorithm that identified a population of peptide enriched above the baseline population in the target but not control selection (Fig. 3D). As we observed in previous sections, mixing R3 inputs emanating from selection campaigns on distinct targets forms an effective baseline; however, a prior experiment carefully delineated library architectures and did not exhaustively test the limits of such mixing. In this selection we mixed R3 inputs from 22 targets in batches of N=4–12 campaigns to test whether such mixing can yield a productive baseline (Fig. 3E). Automation of the DE analysis was one of the benefits of mixing the inputs from as many targets as possible: in the selection from the mixed population the DE analysis recycles the data from input library and control selections. We performed validation of 16000 peptides in two tiers using focused library (Tier 1) and validation of binding properties of 27 synthetic peptides was done using surface plasmon resonance and ELISA (Tier 2).

Figure 3.

Figure 3.

Application of baseline-based detection in parallel panning against multiple targets. A) Example of workflow showing selection of three different libraries against 2 targets using 3 libraries. B) Selection for the same target originating from different libraries are mixed at round 2. C) At Round 3, multiple selection campaigns are mixed and used as input for Round 3 for multiple targets and control selections. D) example of differential enrichment (DE) analysis that identifies baseline sequences and sequences enriched above the baseline in panning against a specific target. In this workflow, multiple selection campaigns share the same baseline. E) Mixing of multiple campaigns and universal baseline makes it possible to streamline analysis of 3 phage libraries against 22 targets (66 independent selections progressively mixed together). F) The same DE-analysis algorithm can be iteratively applied to all selection campaigns to nominate the hit compounds (16,000 total) which were validated by testing them all in parallel in a focused library. G) panning of the same focused library against the targets, controls, followed by DE of NGS data identified enrichment profile for the focused library in each selection presented as a Manhattan plot (H) or as a heat-map (I). For example in selection against target 6 (EGFR) or T9 (PD-L1) enriches subset of peptides from 1544 peptides nominated from T6 or T9 selection campaign. In contrast, peptides integrin selection campaigns show considerable cross-reactivity.

In Tier 1 validation, we used 16,000 peptides identified by DE from 22 selections against targets T1-T22. They were converted to a DNA library encoding such peptides and cloned into a phage displayed focused library [F], as described in our previous report (46). Binding of the [F] to the original 22 targets (Fig. 3G) followed by NGS and DE-analysis identified peptides that exhibited a significant fold change (FC) enrichment when compared to the input and control selections (Fig. 3E). We observed three types of outcomes. For targets like EpCAM (T6) and PD-L1 (T9), selections can be marked as successful. A significant fraction of the peptides nominated to bind to these targets (i.e. [T6O] and [T9O]) indeed have been confirmed to bind to these and only these targets (Fig. 3H). From 1544 peptides from [T6O], those discovered in panning of T6 (EpCAM), 500 bind to EpCAM above the baseline, but not to PD-L1, ROR-1 or integrin avb6. Similarly, peptides from [T11O] population bind to PD-L1 but not to EpCAM, ROR-1 or integrin. In Fig, 3I, which is a heatmap summary of 22 Manhattan plots describing 22 focused library experiments, it is evident that peptides selected in T6 or T9 campaigns that bind to T6 or T9 do not bind to any other 21 targets. For ROR1, from 429 peptides nominated to bind to ROR-1 only 2 peptides exhibited binding to ROR-1 above the “baseline” (Fig. 3H). Finally for integrin targets, we observed an expected poly-specific binding. Binding of [F] to αVβ6, identified binders not only from [T21O] but also from [T16O]-[T20O] and [T22O]. From 138 peptides in [T21O] population many peptides also bind to T17-T20 and T22 (red box, Fig. 3I). In general, selection for T17-T22 targets often identified peptides that bound to the cognate target and other targets in this cluster, but not to other proteins T1-T16. Selection for T16 (α4β7) identified peptides that cross-reacted with T15 (α4β1) but not T17-T22 integrins. Selection for T15 (α4β1) identified many peptides that cross-reacted with T18 (αVβ1) and a few peptides cross-reacting with T16 (α4 β7), T19 (αV β3), T21 (αVβ6) T22 (αVβ8).

In Tier 2 validation, we synthesized 27 peptides from 5 selection campaigns with C-terminal biotin and validated their interaction with the immobilized target using ELISA and dose-response titration of soluble biotinylated peptide to estimate EC50 of binding. We employed at least two other assays. When limited solubility biotinylated peptides or high non-specific binding (nsb) prevented ELISA measurements (e.g., PSMA or EpCAM, Fig. 3J), we employed bilayer interferometry (BLI) assay to estimate Kd using immobilized biotinylated peptide and soluble protein, For CAIX we re-synthesized some peptides without C-terminal biotin and re-tested them by surface plasmon resonance (SPR) with immobilized CAIX to yield kinetic Kd values (Fig. 3J). Overall, the Tier 2 validation confirmed that most peptides that exhibit significant FC values in focused library exhibit mid-nanomolar to single digit micromolar potency in the binding assays.

In the parallel selection against 22-target we did not employ any cross-target depletions. Emergence of poly-specific hits in such selections is anticipated outcome, and has been well-documented in phage display (50) and even in continuous selection procedures (51). A selection against 22 targets covered an array of possibilities from unrelated targets to weakly related HER1-HER3 (T3-T5), VEGFR1–3 (T12-T14), integrins with weak homology (T15 vs. T17-T22) to close homologues (T17-T22) to extreme homologues such as T18 and T22 that differ in only few amino acids in the vicinity of the peptide binding site. We anticipated that by mixing selections from unrelated targets (T1: carbonic anhydrase 9 or CAIX), T2 (c-Met), T3 (EGFR or HER1), etc. the peptides selected to bind to c-Met and EGFR would form a “baseline” relative to CAIX because these three targets share no homology. It is unlikely for c-Met-binding peptide or EGFR-binding peptides to bind to CAIX. Indeed, these selections proved to yield binding peptides that confirmed binding to these and only these targets in focused library selection (Fig. 3I). In the selections against homologous integrin targets (T17-T22), selection of cross-reactive hits (Fig. 3I) was not surprising. For example, selection of peptides against α4β7 often exhibit cross-reactivity to α4β1. Selections that use integrins αVβ3, αVβ5, αVβ6 by phage display have been documented (5254). It is known that outcomes from selection on these integrins frequently give rise to peptides that bind to homologous integrins and selection of isoform-specific peptides requires intricate depletion against cross-reactive targets.(16, 55) Beyond cross-reactivity, the selection procedure successfully identified 60–80% of peptides that bound to the cognate integrin but not to T1-T16 targets outside integrin family. Cross-reactivity with other integrins is unlikely to be related to mixing of libraries or use of baseline in DE-analysis. Selection and validation of integrin-specific peptides that avoid binding to closely related homologues are beyond the scope of this publication.

We tested whether baseline-stratified FC values can offer decision enabling information for the optimization of the properties of sequences. We focused this questions on sequences discovered to bind to PD-L1 protein. The focused library F contained 12×18=216 sequences derived from peptide SDFCSWVPEAFWCQE in which every underlined position was substituted by 18 other amino acids (apart from Cys and the parent amino acids). The heatmap in Fig. 4A describes the FC enrichment of each derivative sequence in the panning experiment. We noted that the changes in certain position gave rise to a baseline-level FC (Fig. 4A). This loss of function indicating that a specific amino acid is indispensable for binding. In other positions several substitutions could be justified as “neutral changes” or improvements (Fig. 4A). The analysis of absolute of relative FC values (Fig. S4) suggested at least six positions for optimization; a separate deep-mutational scan (Fig. S5) suggested further modifications of the N and C termini of the SDFCSWVPEAFWCQE. These plausible positional changes (Fig. 4B), when applied individually or in combination to SDFCSWVPEAFWCQE gave rise to a series of sequences that exhibited a single digit nanomolar potency in cell binding and SPR assays (Fig. 4C) (Fig. S6S12). We note that improvement over that 3–4 orders of magnitude in binding performance (Fig. 4C) was possible without engagement of non-canonical amino acids (ncAA). The only exception was the use of 4R-fluoroproline denoted as π in Fig. 4C. We noted that 4R-fluoroproline provided 2–3-fold benefit in binding whereas 4S-fluoro proline in the same position significantly diminished the binding (not shown). These improvements likely result from the stereo electronic role of fluorine in attenuation of the equilibria between exo/endo ring pucker and trans/cis conformational of the amide bond in proline (56). The properties of the identified sequences with single digit nanomolar performance in cell binding assays were further improved by ncAA mutagenesis but the discussion of these improvements is beyond the scope of this manuscript and they will be presented elsewhere.

Figure 4.

Figure 4.

Application of baseline-based measurements of enrichment as decision-enabling information in optimization. A) Heatmap describing changes in binding performance of phage displayed peptides, as measured by FC, caused by systematic changes of every amino acids to every other proteogenic amino acid (aka “deep mutational scan”). B) Structure of the core region with key optimizable positions identified by the deep mutational scan. C) Synthetic peptides with individual chances of combinations of changes suggested by the deep mutational scan. Properties of the peptides have been assessed in three independent assays. ELISA assay employed a dose-response titration of soluble biotinylated peptide and PD-L1 immobilized on microtiter well place. SPR employed a surface immobilized PD-L1 and soluble peptides; cell-based assay employed a dose response titration of biotinylated peptide with PD-L1(+) CHO cells, where the binding of the peptides was probed by fluorescent streptavidin.

Discussion

The advent of next generation sequencing (NGS) and introduction of NGS to phage display and other selection procedures around 2010’s encouraged many research groups to postulate that a “deep” analysis of in vitro selections by NGS holds a promise to identify hit sequences missed from in vitro selection procedures in traditional analyses based on Sanger sequencing of isolated clonal populations. Delivery on this promise requires a unified framework for the mathematical analysis of the NGS data emanating from the multi-round in vitro selection. Our observation is that such unified framework is still lacking. Some semi-quantitative approaches combine analysis of NGS frequency with analysis of peptide motifs (45, 57); algorithms that search for motifs in NGS data in a supervised or unsupervised fashion have been reported (58, 59); most algorithms have not been adopted across the community. The other analysis simply states that the most abundant peptides in the NGS should be prioritized for validation. Other arguments prioritize sequences that change their enrichment in two adjacent rounds (60). The latter could be tainted by the amplification bias in multi-round selections (14). Bioconductor DE pipeline uniformly accepted in RNASeq community offers unbiased statistics-based analysis of sequences significantly enriched between the test and a plurality of control populations. This pipeline can automatically process hundreds of datasets in parallel. Our lab repurposed Bioconductor DE pipeline to analyze amplification bias in phage libraries and identify enrichment in phage display (61) in DNA-barcoded population of peptide derivatives (44), glycans and other molecules that fail in clustering approaches (6265). Unfortunately, a minor yet fundamental difference exists between RNASeq and in vitro selection datasets. All RNASeq datasets contain a bona fide baseline formed by non-responder transcripts. Such baseline is often absent from in vitro selection procedures. In our lab, we colloquially refer to it as a “swimmer paradox”: DE analysis can readily delineate fast and slow swimmers, but DE analysis fails in analysis of non-swimmers and DE fails in the analysis of the population of Olympic swimmers if they finish the lap with nearly matched time. Without an absolute reference (e.g., chronometry) DE cannot distinguish a uniform population of Olympic swimmers from an equally uniform population of individuals whose swimming ability is zero. Similarly, in a population of strongly binding peptides, DE-analysis fails if all peptides exhibit similar strength of binding. Failed DE in local analysis cannot distinguish between specific and non-specific outcomes and we documented failed DE-Analysis in successful selection procedures (46). Introducing a baseline into all in vitro selections makes DE-analysis robust because binding performance of all sequences is stratified against an absolute non-binding reference. Mixing libraries from two selection procedures breaks the swimmer paradox because it places Olympic swimmers and non-swimmers in the same pool. Our manuscript demonstrates that introduction of baseline enables robust identification of true binders as those that outperform randomly distributed or irrelevant peptide sequences. Baseline allows robust stratification of the selection procedures and automated analysis of large number of selection procedures in parallel.

Materials and Methods

General biochemistry information

The libraries were constructed using TriNuc codons that only contains 19 specific codons excluding cysteine (66). The procedures have been adopted and modified as previously described in publications that produced the M13-displayed SXCXXXC library (61), M13-SDB vector (67), SXCX4C and SXCX5C libraries (43). In short, to produce SX3CX9C library, the vector SB4 QFT*LHQ was digested with KpnI HF (NEB Cat# R3142S) and EagI HF (NEB Cat# R3505S). A primer/template pair consisting of annealed primer 5′-AT GGC GCC CGG CCG AAC CTC CAC C-3′ and template 5′-CC CGG GTA CCT TTC TAT TCT CAC TCT TCT XXX TGT XXXXXXXXX TGT GGT GGA GGT TCG GCC GGG CGC TTG ATT-3′ with ‘XXXXX’ representing trinucleotide cassette (TriNuc) that contains a mixture of 19 codons that encode 19 natural amino acids except cysteine (oligos containing the TriNuc mixtures was synthesized by Genscript). The primer/template was then extended using Klenow DNA polymerase (NEB) according to the manufacturer’s instructions. The insert fragment was then digested with KpnI HF and EagI HF, gel purified and ligated into the cut vector. The ligation products were then transformed into electrocompetent E. coli cells, and the transformants were grown overnight on E. coli TG1 to allow for phage production. Phage cultures were then centrifuged to remove cells and debris, and then the phage was precipitated by PEG precipitation (5% PEG 0.5 M NaCl).

In our previous reports, we characterize diversity of small libraries fully by NGS (61, 68). The true diversity of 109 scale library is difficult to estimate because it would require a prohibitively expensive NGS of >1010 reads (10× coverage of diversity), e.g. requiring 400 runs with MiniSeq High Output Reagent Kit with 25 million reads each, costing 400×$3000 = $1,200,000.

Based on our previous reports, diversity can be inferred from NGS analysis of the sample of naïve libraries, for example, SX3CX9C library covers 2,309,529 reads and detects 2,268,469 unique sequences.

The SX3CX9C, SX2CX8CX2 and SDB 17/SDB10 GS23 libraries were obtained from 48HD. The sequencing files for naïve libraries were uploaded to https://48hd.cloud/ and the links are attached below.

SX3CX9C: https://48hd.cloud/file/11581

SX2CX8CX2: https://48hd.cloud/file/11471

GS23: https://48hd.cloud/file/18801

Panning procedures on NS3a*

Panning procedures on NS3a* was same as described in previous publication.(45)

Panning procedure on chitin

Panning procedures were performed in 3 rounds. In all rounds, chitin magnetic beads (New England Biolabs, #E8036S) were used as the target. Blocking, binding and washing steps were done in KingFisher Duo Prime Purification System (Thermo Scientific, #5400110). Chitin magnetic beads were vortexed for 2 seconds and the solution was attached to the magnet to remove the storage buffer. The beads were washed three times with binding buffer (20 mM Tris HCl, 500 mM NaCl, 1mM EDTA disodium, pH 8.0) and resuspended in 100 uL binding buffer.

For round 1, the beads suspension and other reagents were added to a 96 Deepwell Plate (Thermo Fisher, #95040450) as follows:

Row A: Chitin magnetic beads (New England Biolabs, #E8036S) (0.1 mL/well)

Row B: reserved for 12-tip Deepwell magnetic comb (Thermo Fisher, #97003500)

Row C: Binding Buffer (20 mM Tris HCl, 500 mM NaCl, 1mM EDTA disodium, pH 8.0)

Row D: Blocking Buffer (1 mL, 2% BSA (w/v) in Binding buffer)

Row E: Solution of S2C8C2 phage library (100 μL, 1012 PFU/mL)

Row F: Wash Buffer (1 mL, 0.1 % Tween in Binding buffer)

Row G: Wash Buffer (1 mL, 0.1 % Tween in Binding buffer)

Row H: Binding Buffer (1 mL, 0.1 % Tween in Binding buffer)

Following steps were performed using a KingFisherTM Duo Prime Purification System with a magnetic comb to transfer the beads. The program is as follows:

  1. Collect comb from row B

  2. Collect beads from row A on comb

  3. Wash beads in row C – 30 s

  4. Block in row D – 1 h

  5. Phage binding in row E – 1.5 h

  6. Wash beads in row F – 1 min

  7. Wash beads in row G – 1 min

  8. Wash beads in row H – 1 min.

At the end of the program, the protein coated beads with phage bound were in wells in the last row. The content of each well from row 8 was transferred to individual Eppendorf tube and the tubes were placed in Dynabead MPC-S for 30 seconds to capture the beads. The supernatant was discarded, the beads were resuspended in elution buffer (200 μL, 200 mM glycine pH 2.2) and rotated on Thermo Scientific Labquake 360 rotator (cat# C400220Q) for 9 min and then 30 uL of neutralization buffer (1M Tris HCl pH 9.1) was added immediately. Eluted phage bound was taken for further processing. 200 uL of eluted phages were used for phage amplification (See Amplification protocol).

For round 2, the same protocol was followed, except that we used 30 uL of chitin magnetic beads instead of 50 uL and three washes were performed in 1 mL, 0.1 % Tween in Binding buffer instead of two washes. For round 3, we included the baseline library, used 30 uL of chitin magnetic beads and performed 6 washes in 1 mL, 0.1 % Tween in Binding buffer

Chemically modification of phage libraries

Macrocyclization with MBX: a solution of SX3CX9C phage library (10 μL, ~1013 PFU/mL PBS in 50% glycerol, pH 7.4, amplified output of Round 2) was first diluted with 62 μL of water in a 1.7 mL Eppendorf tube and 10 μL of 1M Tris-base buffer (pH 8.6) was added to adjust the pH of the reaction mixture to 8.6 (checked by universal pH paper). TCEP (2 μL, 100 mM in water) was added to the reaction, mixed by vortexing and incubated at room temperature for 30 min. The reaction was loaded on an equilibrated Zeba Spin column (Thermo Fisher cat# 89883) to remove excess amount of TCEP. 10 μL of 1M Tris-base buffer (pH 8.6) was added, and the macrocyclization reaction was initiated by the addition of MBX linker (1 μL, 10 mM in DMF). The reaction mixture was mixed and incubated at room temperature for 20 min. After 20 min, the reaction mixture was loaded on an equilibrated Zeba Spin column and eluted by centrifugation in 2000 ×G. The macrocyclized phage library was obtained as a cleared colorless solution.

Phage amplification

70 μL of eluted phages were mixed with 250 μL of E2773 bacterial culture and then added to 25 mL of LB media for amplification at 37 °C for 4.5 h. After amplification was done, the culture was transferred to 25 mL Falcon tube and centrifuged for 10 mins at 6,000 × g at 4 °C. Supernatant was transferred to a new 25 mL Falcon tube and 1/5 volume of PEG/NaCl (~5 mL) was added. Phage was incubated 4 °C overnight. Next day, phages were centrifuged at 14,000 × g for 30 mins at 4 °C and the pellet resuspended in 1 mL PBS. The suspension was transferred to 1.7 mL microcentrifuge tube and centrifuged at max speed for 5 mins at 4 °C. The supernatant was transferred to a new 1.7 mL microcentrifuge tube and 1/5 volume to PEG/NaCl (~200 μL) was added and incubated for 1 h on ice. Suspension was centrifuged at maximum speed for 30 mins at 4 °C and the the pellet was resuspended in 500 μL of PBS, and stored at 4 °C.

PCR of phage

Two-step semi-nested PCR was used to improve the sensibility. The method was adapted from our previous developed protocol.(69, 70) DNA template (phage solution) was amplified in Phusion® HF buffer with Phusion® High-Fidelity DNA Polymerase (NEB, #M0530S).

1st step:

A typical 50 μL reaction mixture contained:

  1. 5x Phusion buffer 10 μL

  2. 10 mM dNTPs 1 μL

  3. DMSO 2.5 μL

  4. Phusion® Polymeras 0.5 μL

  5. Forward primer (TTTTGGAGATTTTCAACGTG, 10 μM) 1 μL

  6. Reverse primer (CCCTCATAGTTAGCGTAACG, 10 μM) 1 μL

  7. DNA Template solution 10 μL

  8. Nuclease free water 24 μL

Cycling was performed using the following thermocycler settings:

  1. 98 °C 3 min

  2. 98 °C for 10 s

  3. 50 °C 20 s

  4. 72 °C 20 s

  5. repeat b)-d) for 30 cycles

  6. 12 °C 1 min

  7. 4 °C hold

2st step:

A typical 50 μL reaction mixture contained:

  1. 5x Phusion buffer 10 μL

  2. 10 mM dNTPs 1 μL

  3. DMSO 2.5 μL

  4. Phusion® Polymeras 0.5 μL

  5. Forward primer (10 μM) 1 μL

  6. Reverse primer (10 μM) 1 μL

  7. PCR product from 1st step 2 μL

  8. Nuclease free water 32 μL

Cycling was performed using the following thermocycler settings:

  1. 98 °C 3 min

  2. 98 °C for 10 s

  3. 50 °C 20 s

  4. 72 °C 20 s

  5. repeat b)-d) for 20 cycles

  6. 12 °C 1 min

  7. 4 °C hold

General data processing methods

Core scripts are available as part of the Supplementary_data.zip and https://github.com/derdalab/KYL. Data storage cloud http://48hd.cloud/ was implemented in Linux-Apache-MySQL-Python (LAMP) architecture and details of this implementation are beyond the scope of this report. We used Bioconductor EdgeR DE analysis with modeling of the observed counts using a negative binomial model, Benjamini–Hochberg (BH) adjustment to control the FDR at α = 0.05 and normalization of data across multiple samples using the TMM normalization. Prior to DE-analysis, “test” and “control” datasets were retrieved from the Error! Hyperlink reference not valid. https:/48hd.cloud/ server as tables of peptides, DNA, and raw sequencing counts. Tables of raw sequencing counts and DA values presented in this manuscript are available in Dataset S1.

Processing of Illumina data

The Gzip compressed FASTQ files were downloaded from BaseSpace Sequence Hub. The files were converted into tables of DNA sequences and their counts per experiment. Briefly, FASTQ files were parsed based on unique multiplexing barcodes within the reads discarding any reads that contained a low-quality score. Mapping the forward (F) and reverse (R) barcoding regions allowing no more than one base substitution each and F-R read alignment allowing no mismatches between F and R reads yielded DNA sequences located between the priming regions. The files with DNA reads, raw counts, and mapped peptide modifications were uploaded to http://48hd.cloud/ server. Each experiment has a unique alphanumeric name (e.g., 20230515–1666OOooJUA-KJ) and unique static URL: for example https://48hd.cloud/file/14363)

URL links to sequencing data can be found in Supplementary Tables S1S6. Sequencing files of the reshaped libraries were further processed by python to sort mixed libraries, combine repeated peptide sequences, remove stuffer and other peptides sequences that are not related to the libraries of use. The scripts and the processed sequencing files are available in Dataset S2.

Supplementary Material

Supplement 1
media-1.pdf (2.7MB, pdf)
Supplement 2
media-2.zip (18.5MB, zip)

Significance Statement.

Genetically encoded selection technologies such as phage, mRNA and ribosome display, have produced FDA-approved therapeutics and numerous clinical candidates. Yet reproducibility in such in vitro discovery systems is rarely evaluated against a defined experimental baseline. Here, we establish a universal baseline by spiking unrelated, DNA -barcoded peptide sequences into selection libraries and quantifying their binding alongside target-enriched populations. This composition-agnostic strategy enables rigorous normalization, confidence assessment, and cross-target comparison of molecular discovery outcomes. Our framework introduces practical standards for reproducibility and statistical benchmarking across genetically encoded display platforms.

Acknowledgments

The authors acknowledge funding from Natural Sciences and Engineering Council of Canada (RGPIN-2022–04484 to R.D.), Natural Sciences and Engineering Council of Canada Accelerator Supporting (to R.D.), Canadian Institutes of Health Research (CIHR FRN: 168961 to R.D.) and Mitacs Elevate Fellowship (to G.M.L.). Infrastructure support was provided by the Canada Foundation for Innovation New Leader Opportunity (to R.D.) and National Institute of General Medical Sciences of the National Institutes of Health (R01GM145011 and F31GM155953 to F.B.M).

Footnotes

Competing Interest Statement: R.D. is the C.F.O. and a shareholder of 48Hour Discovery Inc. . The other authors declare no competing financial interest.

References

  • 1.Wang L. et al. , Therapeutic peptides: current applications and future directions. Signal transduction and targeted therapy 7, 48 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Peterson A. A., Liu D. R., Small-molecule discovery through DNA-encoded libraries. Nature Reviews Drug Discovery 22, 699–722 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Smith G. P., Filamentous fusion phage: novel expression vectors that display cloned antigens on the virion surface. Science 228, 1315–1317 (1985). [DOI] [PubMed] [Google Scholar]
  • 4.Wilson D. S., Keefe A. D., Szostak J. W., The use of mRNA display to select high-affinity protein-binding peptides. Proceedings of the National Academy of Sciences 98, 3750–3755 (2001). [Google Scholar]
  • 5.Hanes J., Plückthun A., In vitro selection and evolution of functional proteins by using ribosome display. Proceedings of the National Academy of Sciences 94, 4937–4942 (1997). [Google Scholar]
  • 6.Boder E. T., Wittrup K. D., Yeast surface display for screening combinatorial polypeptide libraries. Nature biotechnology 15, 553–557 (1997). [Google Scholar]
  • 7.Tucker A. T. et al. , Discovery of next-generation antimicrobials through bacterial self-screening of surface-displayed peptide libraries. Cell 172, 618–628. e613 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Ellington A. D., Szostak J. W., In vitro selection of RNA molecules that bind specific ligands. nature 346, 818–822 (1990). [DOI] [PubMed] [Google Scholar]
  • 9.Tse B. N., Snyder T. M., Shen Y., Liu D. R., Translation of DNA into a library of 13 000 synthetic small-molecule macrocycles suitable for in vitro selection. Journal of the American Chemical Society 130, 15611–15626 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Leek J. T. et al. , Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics 11, 733–739 (2010). [Google Scholar]
  • 11.Denton K. E. et al. , Robustness of in vitro selection assays of DNA-encoded peptidomimetic ligands to CBX7 and CBX8. SLAS DISCOVERY : Advancing the Science of Drug Discovery 23, 417–428 (2018). [Google Scholar]
  • 12.Reddy R. et al. , Evaluation of multiplexed liquid glycan Array (LiGA) for serological detection of glycan-binding antibodies. Glycobiology 35, cwaf042 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Bakhshinejad B., Zade H. M., Shekarabi H. S. Z., Neman S., Phage display biopanning and isolation of target-unrelated peptides: in search of nonspecific binders hidden in a combinatorial library. Amino Acids 48, 2699–2716 (2016). [DOI] [PubMed] [Google Scholar]
  • 14.Matochko W. L., Cory Li S., Tang S. K., Derda R., Prospective identification of parasitic sequences in phage display screens. Nucleic acids research 42, 1784–1798 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Devlin J. J., Panganiban L. C., Devlin P. E., Random peptide libraries: a source of specific protein binding molecules. Science 249, 404–406 (1990). [DOI] [PubMed] [Google Scholar]
  • 16.Koivunen E., Wang B., Ruoslahti E., Phage libraries displaying cyclic peptides with different ring sizes: ligand specificities of the RGD-directed integrins. Bio/technology 13, 265–270 (1995). [DOI] [PubMed] [Google Scholar]
  • 17.Koivunen E., Gay D. A., Ruoslahti E., Selection of peptides binding to the alpha 5 beta 1 integrin from phage display library. Journal of Biological Chemistry 268, 20205–20210 (1993). [PubMed] [Google Scholar]
  • 18.Ruoslahti E., RGD and other recognition sequences for integrins. Annual review of cell and developmental biology 12, 697–715 (1996). [Google Scholar]
  • 19.Mayrose I. et al. , Epitope mapping using combinatorial phage-display libraries: a graph-based algorithm. Nucleic acids research 35, 69–78 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Li T. et al. , Blocker-SELEX: a structure-guided strategy for developing inhibitory aptamers disrupting undruggable transcription factor interactions. Nature Communications 15, 6751 (2024). [Google Scholar]
  • 21.Liu K. et al. , Baited SELEX: drug-directed selection of aptamers to PSMA for in vivo targeting of prostate cancer xenografts in mice. Journal of the American Chemical Society 147, 40879–40894 (2025). [DOI] [PubMed] [Google Scholar]
  • 22.Yu Z. et al. , Discovery and characterization of bromodomain 2–specific inhibitors of BRDT. Proceedings of the National Academy of Sciences 118, e2021102118 (2021). [Google Scholar]
  • 23.Fernandez-Montalvan A. E. et al. , Isoform-selective ATAD2 chemical probe with novel chemical structure and unusual mode of action. ACS chemical biology 12, 2730–2736 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Oehler S. et al. , Affinity Selections of DNA‐Encoded Chemical Libraries on Carbonic Anhydrase IX‐Expressing Tumor Cells Reveal a Dependence on Ligand Valence. Chemistry–A European Journal 27, 8985–8993 (2021). [DOI] [PubMed] [Google Scholar]
  • 25.Dawadi S. et al. , Discovery of potent thrombin inhibitors from a protease-focused DNA-encoded chemical library. Proceedings of the National Academy of Sciences 117, 16782–16789 (2020). [Google Scholar]
  • 26.Bowley D., Labrijn A., Zwick M., Burton D., Antigen selection from an HIV-1 immune antibody library displayed on yeast yields many novel antibodies compared to selection from the same library displayed on phage. Protein Engineering, Design & Selection 20, 81–90 (2007). [Google Scholar]
  • 27.Zhao J., Li Y., Terasaka N., Aikawa H., Suga H., Diversity Scale of Library Matters: Impact of mRNA Library Diversity Scales on the Discovery of Macrocyclic Peptides Targeting a Protein by the RaPID System. ACS Central Science 11, 431–440 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Almagro J. C., Pedraza-Escalona M., Arrieta H. I., Pérez-Tapia S. M., Phage display libraries for antibody therapeutic discovery and development. Antibodies 8, 44 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Takahashi M. et al. , High throughput sequencing analysis of RNA libraries reveals the influences of initial library and PCR methods on SELEX efficiency. Scientific reports 6, 33697 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Carothers J. M., Oestreich S. C., Davis J. H., Szostak J. W., Informational Complexity and Functional Activity of RNA Structures. Journal of the American Chemical Society 126, 5130–5137 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Storz J. F., Causes of molecular convergence and parallelism in protein evolution. Nature Reviews Genetics 17, 239–250 (2016). [Google Scholar]
  • 32.Popović M. et al. , In vitro selections with RNAs of variable length converge on a robust catalytic core. Nucleic acids research 49, 674–683 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Stern D. L., The genetic causes of convergent evolution. Nature Reviews Genetics 14, 751–764 (2013). [Google Scholar]
  • 34.Dickinson B. C., Leconte A. M., Allen B., Esvelt K. M., Liu D. R., Experimental interrogation of the path dependence and stochasticity of protein evolution using phage-assisted continuous evolution. Proceedings of the National Academy of Sciences 110, 9007–9012 (2013). [Google Scholar]
  • 35.Lu S. S. et al. , Mapping the diverse topologies of protein-protein interaction fitness landscapes. bioRxiv, 2025.2010. 2014.682342 (2025). [Google Scholar]
  • 36.Styles M. J. et al. , PANCS-Binders: a rapid, high-throughput binder discovery platform. Nature Methods 22, 1720–1730 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Malo N., Hanley J. A., Cerquozzi S., Pelletier J., Nadon R., Statistical practice in high-throughput screening data analysis. Nature biotechnology 24, 167–175 (2006). [Google Scholar]
  • 38.Zhang J.-H., Chung T. D., Oldenburg K. R., A simple statistical parameter for use in evaluation and validation of high throughput screening assays. Journal of biomolecular screening 4, 67–73 (1999). [DOI] [PubMed] [Google Scholar]
  • 39.Eyke N. S., Koscher B. A., Jensen K. F., Toward machine learning-enhanced high-throughput experimentation. Trends in Chemistry 3, 120–132 (2021). [Google Scholar]
  • 40.Catacutan D. B., Alexander J., Arnold A., Stokes J. M., Machine learning in preclinical drug discovery. Nature Chemical Biology 20, 960–973 (2024). [DOI] [PubMed] [Google Scholar]
  • 41.Tsang John S. et al. , Global Analyses of Human Immune Variation Reveal Baseline Predictors of Postvaccination Responses. Cell 157, 499–513 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Lancet D., Sadovsky E., Seidemann E., Probability model for molecular recognition in biological receptor repertoires: significance to the olfactory system. Proceedings of the National Academy of Sciences 90, 3715–3719 (1993). [Google Scholar]
  • 43.Tjhung K. F. et al. , Silent Encoding of Chemical Post-Translational Modifications in Phage-Displayed Libraries. Journal of the American Chemical Society 138, 32–35 (2016). [DOI] [PubMed] [Google Scholar]
  • 44.Ekanayake A. I. et al. , Genetically Encoded Fragment-Based Discovery from Phage-Displayed Macrocyclic Libraries with Genetically Encoded Unnatural Pharmacophores. Journal of the American Chemical Society 143, 5497–5507 (2021). [DOI] [PubMed] [Google Scholar]
  • 45.Yan K. et al. , Late-Stage Reshaping of Phage-Displayed Libraries to Macrocyclic and Bicyclic Landscapes using a Multipurpose Linchpin. Journal of the American Chemical Society 147, 789–800 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Alteen M. G. et al. , Phage display uncovers a sequence motif that drives polypeptide binding to a conserved regulatory exosite of O-GlcNAc transferase. Proceedings of the National Academy of Sciences 120, e2303690120 (2023). [Google Scholar]
  • 47.Quartararo A. J. et al. , Ultra-large chemical libraries for the discovery of high-affinity peptide binders. Nature communications 11, 3183 (2020). [Google Scholar]
  • 48.Kale S. S. et al. , Cyclization of peptides with two chemical bridges affords large scaffold diversities. Nature Chemistry 10, 715–723 (2018). [Google Scholar]
  • 49.Zhang S. et al. , Radiopharmaceuticals and their applications in medicine. Signal transduction and targeted therapy 10, 1 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Kaew-Amdee S., Makornwattana M., Charlermroj R., Identification of novel human IgE-binding peptides from a phage display library for total IgE detection. Scientific Reports 15, 27986 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Yang L., Zhang J., Andon J. S., Li L., Wang T., Rapid discovery of cyclic peptide protein aggregation inhibitors by continuous selection. Nature chemical biology 21, 588–597 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Rader C., Cheresh D. A., Barbas III C. F., A phage display approach for rapid antibody humanization: designed combinatorial V gene libraries. Proceedings of the National Academy of Sciences 95, 8910–8915 (1998). [Google Scholar]
  • 53.Kraft S. et al. , Definition of an unexpected ligand recognition motif for αvβ6 integrin. Journal of Biological Chemistry 274, 1979–1985 (1999). [DOI] [PubMed] [Google Scholar]
  • 54.Singh A. N. et al. , Dimerization of a phage-display selected peptide for imaging of αvβ6-integrin: two approaches to the multivalent effect. Theranostics 4, 745 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Li R., Hoess R. H., Bennett J. S., DeGrado W. F., Use of phage display to probe the evolution of binding specificity and affinity in integrins. Protein engineering 16, 65–72 (2003). [DOI] [PubMed] [Google Scholar]
  • 56.Newberry R. W., Raines R. T., “4-Fluoroprolines: Conformational analysis and effects on the stability and folding of peptides and proteins” in Peptidomimetics I. (Springer, 2016), pp. 1–25. [Google Scholar]
  • 57.Diderich P. et al. , Phage selection of chemically stabilized α-helical peptide ligands. ACS chemical biology 11, 1422–1427 (2016). [DOI] [PubMed] [Google Scholar]
  • 58.Brown J. S. et al. , Regularized indirect learning improves phage display ligand discovery. (2023). [Google Scholar]
  • 59.Teyra J., Sidhu S. S., Kim P. M., Elucidation of the binding preferences of peptide recognition modules: SH3 and PDZ domains. FEBS letters 586, 2631–2637 (2012). [DOI] [PubMed] [Google Scholar]
  • 60.Juds C. et al. , Combining phage display and next-generation sequencing for materials sciences: a case study on probing polypropylene surfaces. Journal of the American Chemical Society 142, 10624–10628 (2020). [DOI] [PubMed] [Google Scholar]
  • 61.He B. et al. , Compositional Bias in Naïve and Chemically-modified Phage-Displayed Libraries uncovered by Paired-end Deep Sequencing. Scientific Reports 8, 1214 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Sojitra M. et al. , Measuring carbohydrate recognition profile of lectins on live cells using liquid glycan array (LiGA). Nature Protocols 20, 989–1019 (2025). [DOI] [PubMed] [Google Scholar]
  • 63.Lin C.-L. et al. , Chemoenzymatic synthesis of genetically-encoded multivalent liquid N-glycan arrays. Nature Communications 14, 5237 (2023). [Google Scholar]
  • 64.Sojitra M. et al. , Genetically encoded multivalent liquid glycan array displayed on M13 bacteriophage. Nature chemical biology 17, 806–816 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Lima G. M. et al. , The liquid lectin array detects compositional glycocalyx differences using multivalent DNA-encoded lectins on phage. Chemistry & Biology 31, 1986–2001. e1989 (2024). [Google Scholar]
  • 66.Wong J. Y. K. et al. , Genetically encoded discovery of perfluoroaryl macrocycles that bind to albumin and exhibit extended circulation in vivo. Nat. Commun. 14, 5654 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Sojitra M. et al. , Genetically encoded multivalent liquid glycan array displayed on M13 bacteriophage. Nat. Chem. Biol. 17, 806–816 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Matochko W. L. et al. , Deep sequencing analysis of phage libraries using Illumina platform. Methods 58, 47–55 (2012). [DOI] [PubMed] [Google Scholar]
  • 69.Reddy R. (2021) Multiplexed Liquid Glycan Array (LiGA) for serological assays. in Chemistry (University of Alberta, Education and Research Archive; ). [Google Scholar]
  • 70.Lin C.-L. et al. , Chemoenzymatic synthesis of genetically-encoded multivalent liquid N-glycan arrays. Nat. Commun. 14, 5237 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1
media-1.pdf (2.7MB, pdf)
Supplement 2
media-2.zip (18.5MB, zip)

Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES