Randomized gates eliminate bias in sort‐seq assays

Brian L Trippe; Buwei Huang; Erika A DeBenedictis; Brian Coventry; Nicholas Bhattacharya; Kevin K Yang; David Baker; Lorin Crawford

doi:10.1002/pro.4401

. 2022 Aug 30;31(9):e4401. doi: 10.1002/pro.4401

Randomized gates eliminate bias in sort‐seq assays

Brian L Trippe ^1,^2,^3,^✉, Buwei Huang ^3,⁴, Erika A DeBenedictis ^3,⁴, Brian Coventry ^3,⁴, Nicholas Bhattacharya ^2,⁵, Kevin K Yang ², David Baker ^3,^4,⁶, Lorin Crawford ^2,^✉

PMCID: PMC9601873

Abstract

Sort‐seq assays are a staple of the biological engineering toolkit, allowing researchers to profile many groups of cells based on any characteristic that can be tied to fluorescence. However, current approaches, which segregate cells into bins deterministically based on their measured fluorescence, introduce systematic bias. We describe a surprising result: one can obtain unbiased estimates by incorporating randomness into sorting. We validate this approach in simulation and experimentally, and describe extensions for both estimating group level variances and for using multi‐bin sorters.

Keywords: fluorescence activated cell sorting, high‐throughput screening, multiplexed measurements, sort‐seq assays, statistical methods

1. INTRODUCTION

Quantitative, multiplexed assays relying on fluorescence activated cell sorting (FACS) followed by high‐throughput sequencing are critical to modern biology and molecular engineering because they enable construction of large scale datasets connecting sequence to function. For example, these “sort‐seq” assays are widely used to profile the strength of protein–protein binding interactions via yeast display. ¹ , ² , ³ , ⁴ In particular one (i) synthesizes a library of 10⁴ to 10⁵ DNA sequences encoding proteins that may bind to a target of interest; (ii) transforms the library into yeast such that each putative binder is expressed on the surface of a population of cells; (iii) incubates cells with fluorescently labeled target protein; (iv) physically separates 10⁶ to 10⁸ cells based on binding affinity by FACS; and finally, (v) quantifies the prevalence, and thereby binding affinity, of each library member by high throughput sequencing. Due to biological and technical variability, there is a distribution over (log) fluorescence for each library sequence, and the challenge is to estimate the means of each of these distributions (Figure 1a,b). For example, for binding interactions, this mean fluorescence relates directly to biophysical quantities of interest including dissociation constants and binding energies. ⁴ , ⁵ , ⁶

Schematic overview of randomized gates. (a) Distributions of log fluorescence for different cell populations and (b) their hypothetical true and estimated means. (c) An example of histogram approach with deterministic collection into four bins and (d) an example of randomized collection approach with two bins. (e) Estimated means of the randomized gating scheme are more accurate than the histogram approach as the number of collected cells increases

In previous work, cells are deterministically segregated into one or more collection tubes (referred to as “bins”) based on their measured fluorescences, and the mean fluorescence of each population is estimated from the histogram of observed sequence counts in each bin (Figure 1c). Peterman and Levine ⁷ compare the error associated with different strategies for collecting and analyzing such data, and they show that average squared error is the sum of contributions from bias and variance (e.g., Hastie et al., ⁸ Chapter 7.3). The variance arises from experimental noise and variability across cells, and it can be reduced by increasing the number of cells screened. The bias arises from the discretization of the space of log fluorescence into bins (Figure 1b,c); for example, narrow distributions can be sorted all into the same bin but have means as different as the bin width. Moreover, this bias poses reproducibility challenges; the direction and magnitude of bias depends on how the bins are chosen, but this choice is subjective and commonly depends on variable experimental conditions. Because even the most sophisticated FACS machines can sort cells into at most six bins, resolution is limited. This low resolution limits the value of sort‐seq data in quantitative analyses, for instance, by prohibiting computation of precise binding energies. This challenge has spurred much work on how to effectively reduce histogram bias. ⁵ , ⁶ , ⁷ , ⁹ One common approach seeks to overcome the resolution limits of histograms by assuming fluorescence is log‐normally distributed for each population and using maximum likelihood estimation to estimate moments. ² , ⁹ , ¹⁰ , ¹¹ However, on real data, this assumption is violated and the resulting estimates can have greater bias than the naive approach (Figure S1).

2. RESULTS AND DISCUSSION

In this work, we show that the bias generated using histograms can be eliminated altogether by incorporating randomness into FACS collection strategies with as few as two bins (Figure 1d), thereby obtaining arbitrarily accurate estimates with many cells (Figure 1e). To do this, we take a statistical approach. We consider a population of cells that pass through a 2‐bin sorter, each with log fluorescence F independently and identically distributed according to a density function p _F. Our target of interest is the mean log fluorescence, $μ_{F} = \int f p_{F} (f) d f .$ Let B denote the bin (either 1 or 2) into which a cell is collected, and let Y ₁ and Y ₂ be the counts of cells in Bins 1 and 2 after sorting, respectively. In multiplexed sort‐seq assays, we obtain Y ₁ and Y ₂ for thousands of populations, and our goal is to accurately estimate the mean of each population simultaneously.

For standard binning, a gate is chosen for each bin that defines the range of values F for which cells are collected into that bin; so, the bin B is deterministic once F is measured (e.g., as in Figure 1c). We instead consider randomized gates which define for each bin the probability of collecting a cell at each fluorescence (as in Figure 1d) and rely on pseudo‐random numbers to determine the bin. For estimating population means, when the fluorescence measurements fall between lower and upper bounds L and U, one first sorts using randomized gates such that for any f on the interval [L,U],

ℙ (B = 1| F = f) = 1 - \frac{f - L}{U - L} and ℙ (B = 2| F = f) = \frac{f - L}{U - L} .

(1)

The counts are then combined into an empirical estimate of $μ_{F}$ as $\hat{μ} = (U - L) \cdot Y_{2} / (Y_{1} + Y_{2}) + L$ .

While one might expect introducing randomness to decrease precision by introducing additional noise, $\hat{μ}$ is directly informative to the mean fluorescence. In particular, $\hat{μ}$ is an unbiased estimate of the true population mean in the sense that the average value we would expect for $\hat{μ}$ if we repeated the sort‐seq experiment many times is equal to $μ_{F}$ (see Theorem 1 in Section 4).

This unbiasedness theorem guarantees that, in contrast to the histogram approach, we can get arbitrarily accurate estimates by screening a larger numbers of cells (Figures 1e and S2). More precisely, recalling that the mean squared error (MSE) is the sum of the bias squared and the variance, ⁸ unbiasedness implies that the error of $\hat{μ}$ is dictated solely by its variance. Moreover, $\hat{μ}$ allows a transparent trade‐off between the number of cells sorted per population and the precision of the estimates; notably, with as few as 400 cells, a 95% confidence interval for $μ_{F}$ will cover at most 10% of the range from L to U (Section 4).

2.1. Randomized gates provide superior accuracy to histograms in simulation

We used a simulation study to explore the implications of unbiasedness on estimation accuracy with the randomized gate approach relative to the standard histogram approach. In this study, we simulated fluorescence of 250 cells from log‐normal distributions with different means and variances (Figure 2a). We then simulated sorting these cells based on their fluorescence either with four deterministic gates of equal width or with two randomized gates as dictated by Equation (1). For the deterministic gates, we constructed histograms and computed estimates of the mean fluorescence as the average of the bin centers weighted by the fraction of cells they contained; and for the randomized gates, we estimated the mean as $\hat{μ}$ . Figure 2b,c report the performance of these estimates in terms of MSE, along with their bias and variance components. As expected, the randomized gates approach has negligible bias except for broad distributions violating the conditions of our theorem (Section 4).

Simulation study reveals improved estimation properties obtained with randomized gates as compared to histograms. (a) Fluorescence values of cells are drawn independently from log‐normal distributions with different scales and with varied means, where the black arrows represent simulated changes of the means. (b) The relative performances of estimates from histograms and randomized gates across a range of mean log fluorescences in terms of mean squared error (ratios greater than 1 reflect lower error with randomized gates and ratios below 1 reflect lower error with histograms). (c) The mean squared error (left) decomposed into bias (center) and variance (right) for both estimates. All points are the average across 200 replicates, each with N = 250 cells

With even as few as 250 cells per population, the MSE of the histogram approach is dominated by bias. Accordingly, the unbiased randomized approach typically provides more accurate estimates. Notably, 250 cells is fewer than is the typical in sort‐seq assays; with larger samples, more pronounced improvements are obtained (Figure S2). Because the histogram estimates are systematically biased toward bin centers, they can however be more accurate for narrow distributions with means near bin centers (Figure 2b).

2.2. Experimental implementation via shifting gate thresholds

We next tested our approach experimentally. Current FACS software does not support randomized gate programming, so we devised an experimental approximation in which we manually changed the gating threshold 20 times during sorting at regular intervals (Section 4). We tested this procedure in the context of a binding assay using yeast display. ¹² We synthesized DNA encoding four mini‐protein binders to the SARS‐COV‐2 receptor binding domain (RBD) with a range of binding affinities. ¹ While the value of this approach is greatest for highly multiplexed assays with many thousands of sequences, we chose this small number so that we could also test each binder easily in serial. We separately transformed and expressed each design in yeast and then incubated the populations with RBD. Both the target and binders were fluorescently labeled, and we considered the log ratio of target to binder fluorescence as an expression normalized proxy for binder strength. ⁶ We measured each sample on a Sony SH800 cell sorter separately, recording the binding signal for each binder (Figure 3a). We then pooled the samples together and sorted 1,000,000 cells, collecting 50,000 cells at each of the 20 thresholds (Section 4).

Agreement of binding signal of de novo designed binding proteins measured via yeast display in multiplex with ground truth values obtained in clonal yeast. (a) Distributions of samples measured clonally by flow cytometry, and distributions of pooled samples during sorting with a shifting gate boundary. Black triangles represent 6 of the 20 stopping points for the shifted gate. (b) Agreement of clonal and multiplexed binding signal. The x‐axis is measured by flow cytometry while the y‐axis is a multiplexed measurement by next‐generation sequencing. Error bars represent size of the steps used when shifting the threshold

The multiplexed measurements largely recapitulate the ground‐truth clonal measurements (Figure 3b), with the exception of design candidate 2018, for which the multiplexed estimate is below the clonal one. We suspect this is due to dissociation of some of the target protein in the time between the clonal and multiplexed measurements; kinetics experiments suggest dissociation occurs rapidly for this design. ¹

3. CONCLUSION

In the supplementary note, we additionally describe two extensions of this idea. First, because the differences in the variability of fluorescence across each population is often of interest (in addition to mean fluorescence), we show how to extend the approach to estimate the variance for each population and validate this approach in simulation (Figure S4). Second, we describe how to effectively take advantage of sorters that sort into more than two bins simultaneously to obtain more accurate estimates. We view these contributions as a starting point for future work of using randomness to obtain precise, multiplexed estimates.

We have shown how to obtain precise, multiplexed estimates in sort‐seq experiments with a simple strategy that incorporates randomness. With as few as two randomized gates, this mathematical technique allows one to collect more accurate data than one could previously obtain with four or six bin sorters. Moreover, this greater accuracy is attained with less sophisticated hardware and less downstream experimental effort. While we have emphasized studies of binding affinity, we believe our strategy is applicable to a wider range of applications of sort‐seq assays including studying transcriptional regulation ¹⁰ , ¹³ and protein stability, ¹¹ and building datasets for protein design. ¹⁴ Widespread implementation of randomized gates in FACS and community adoption of this strategy, will greatly simplify and improve sort‐seq assays by eliminating a common bias in this ubiquitous assay. We believe this will allow FACS to play a more central role in screening settings, for construction of reliable datasets for machine learning models in bio‐design applications, and for building datasets for quantitative models in biology more generally.

4. MATERIALS AND METHODS

4.1. Unbiasedness of estimates from randomized gates

The advantage of the randomized gates presented in Equation (1) is that the resulting counts in each bin (Y ₁ and Y ₂) may be combined as $\hat{μ} = (U - L) \cdot Y_{2} / (Y_{1} + Y_{2}) + L$ to estimate $μ_{F}$ without bias. We make this statement precise and present a theorem that guarantees when this is the case.

For an estimator $\hat{θ}$ of a fixed estimand θ, the estimator's bias is the expected value of its error $E [\hat{θ} - θ | θ]$ conditioned on that particular value of θ. An estimator is called unbiased if, regardless of the value of the estimand, the bias is equal to zero—that is, if $E [\hat{θ} | θ] = θ$ for every θ. Theorem 1 states that this property holds for $\hat{μ} .$

Theorem 1

(Unbiasedness with randomized gates). If the support of p_F is bounded between L and U, then $\hat{μ}$ is an unbiased estimator of mean fluorescence. That is, $E [\hat{μ}] = μ_{F}$ .

We begin by rewriting the probability that a cell is collected into Bin 2 to expose the connection between this quantity and $μ_{F}$ :

$\begin{array}{l} ℙ (B = 2) = \int_{L}^{U} p_{F} (f) ℙ (B = 2 | F = f) df \\ / / via law of total probability support assumption \\ = \int_{L}^{U} p_{F} (f) (f - L) / (U - L) df \\ / / by Equation (1) \\ = (μ_{F} - L) / (U - L) . \end{array}$

If N = Y ₁ + Y ₂ total cells are collected, then the count in the second bin is distributed as $Y_{2} ∣ N ~ Binomial ((μ_{F} - L) / (U - L), N)$ and has mean $E [Y_{2}] = N \cdot (μ_{F} - L) / (U - L)$ . Accordingly, for any N total number of cells, $E [\hat{μ}| Y_{1} + Y_{2} = N] = N \cdot (μ_{F} - L) / (Y_{1} + Y_{2}) + L = μ_{F}$ . When N is random as well, then by the law of iterated expectation, $E [\hat{μ}] = E [E [\hat{μ}| Y_{1} + Y_{2} = N]] = μ_{F}$ as desired.

Notably, this theorem holds for any distribution p _F satisfying the support condition and does not require any parametric assumptions such as log‐normality.

4.2. Trade‐off between number of cells sorted and precision of estimates

The relative simplicity of the estimate $\hat{μ}$ leads to a transparent trade‐off between the precision and scale of the experiment. Recalling that $Y_{2} ∣ N ~ Binomial ((μ_{F} - L) / (U - L), N),$ the variance of $\hat{μ}$ is

Var [\hat{μ}| N] = \frac{{(U - L)}^{2}}{N^{2}} Var [Y_{2}] = \frac{{(U - L)}^{2}}{N} ℙ (B = 1) ℙ (B = 2) .

To construct a confidence interval for $μ_{F},$ we can therefore first approximate the standard error of $\hat{μ}$ by $\frac{U - L}{\sqrt{N}} \frac{\sqrt{Y_{1} Y_{2}}}{N},$ and appeal to approximate normality of the Binomial distribution for moderate to large N to report $μ_{F} = \hat{μ} \pm 2 \frac{U - L}{\sqrt{N}} \frac{\sqrt{Y_{1} Y_{2}}}{N}$ with 95% confidence. Because $\sqrt{Y_{1} Y_{2}} / N$ can be at most 1/2 (if Y ₁ = Y ₂), the size of this interval is at most $2 (U - L) / \sqrt{N} .$ Therefore, to estimate $μ_{F}$ to within one tenth of the range with high confidence, at most N = 400 cells are needed, since in this case $2 (U - L) / \sqrt{N} = (U - L) / 10$ .

For scale, commercial machines sort on the order of 10,000 cells per second, and typical assays sort tens of millions of cells divided amongst many populations. Thus, a library of 100,000 populations could be screened to high precision with on the order of 1 hr of sorting time.

4.3. Simulation details

In the simulations depicted in Figure 2, we compare against the standard approach of using a histogram to estimate $μ_{F} .$ Consider a K bin histogram. For each bin k, if the range of fluorescences collected is from lower bound l _k to upper bound u _k, then $ℙ (B = k| F = f) = 1 [l_{k} \leq f < u_{k}] .$ The histogram estimate then corresponds to combining the resulting counts as

{\hat{μ}}_{Hist} = \sum_{k = 1}^{K} \frac{Y_{k}}{N} (\frac{u_{k} + l_{k}}{2}) .

In order to use the unbiased estimator, both in simulation and in practice, we must slightly extend the randomized gate definition proposed in Equation (1). In particular, Theorem 1 assumes that the support of the fluorescence density p _F is bounded between L and U (i.e., that for $F ~ p_{F}, ℙ [L \leq F \leq U] = 1$ ). In practice, this may not be the case. But, as previously stated, Equation (1) returns negative “probabilities” outside of this range. Therefore, we propose to “clip” the collection probabilities at the boundaries, and instead define

\begin{array}{l} ℙ (B = 1 | F = f) = {(1 - \frac{f - L}{U - L})}_{†} and \\ ℙ (B = 2 | F = f) = {(\frac{f - L}{U - L})}_{†} \end{array}

where † denotes clipping between zero and one such that, for a scalar $x, {(x)}_{†} = \max (\min (x, 1), 0) .$ This ensures that $\hat{μ}$ is well‐defined, but gives up unbiasedness in situations where the support assumption of Theorem 1 is violated. This bias is apparent, for example, at the right and left sides of the left panel of Figure 2c.

4.4. Experimental approximation of randomized gates with shifting thresholds

Because current FACS software does not support randomized gate programming, we devised an experimental approximation in which we manually changed the gating threshold 20 times during sorting at regular intervals. Specifically, we use a gate that collects all cells with fluorescence above a threshold into Bin 2 and those below the threshold into Bin 1, and we shift that threshold over the course of the collection from the lower limit L to the upper limit U. In theory, this approach exactly recovers Equation (1) in the limit that the threshold is shifted continuously from L to U at a constant rate. This is because for a cell with fluorescence f between L and U, the probability that it is collected into Bin 2 is the fraction of the experimental time during which the threshold is below f, which is (f − L)/(U − L). This approximation does not, however, account for possible changes in the distribution, p _F over time. Such changes occur in binding assays, for example, when nontrivial labeled target protein dissociates over time. This challenge is a disadvantage of the approximation relative to randomized gates that could in theory be implemented into sorters.

4.5. Yeast display and deep sequencing

EBY100 yeast cells expressing each of the four mini‐protein binders were grown in C‐Trp‐Ura media. Binder protein expression was induced by replacing the growing buffer with SGCAA and incubating at 30°C for 24 hr. ¹⁵ The induced cells were labelled with 250 nM biotinylated receptor binding domain target protein, washed twice with PBSF (PBS + 1% BSA), then labelled again with anti‐c‐Myc fluorescein isothiocyanate (FITC) and streptavidin‐phycoerythrin (SAPE). The experiments were performed on a Sony SH800 cell sorter. Sixty thousand cells were recorded for each binder to reflect the individual distribution of baseline PE signal intensity. In the shifting gate experiment, a square area (AreaTotal) with side length (L) was pre‐determined at the SH800 collection panel. The area was divided into 2 separate collection gates, Gate1 and Gate2 (corresponding to Bin 1 and Bin 2 in Equation (1)). Gate2 was in an isosceles right triangle and started with a small area in the right‐bottom corner of AreaTotal and Gate1 took up the remaining. The yeast cells were run through the SH800 and each cell went into either the Gate2 or Gate1 collection tube if its log PE/FITC signal was in the range of AreaTotal. All other cells were discarded. After collecting 50,000 cells, the cell flow was paused, Gate2 was shifted both leftwards and upwards for L/10 and cell flow continued. Because the proprietary software for operating the sorter allowed setting gate positions only through a point and click graphical user interface (rather than numerically), we measured out gate increments by pixel distance on the display using a ruler. The above shifting process repeated 19 times for a total of 20 collections. The cells collected in Gate1 and Gate2 were then grown, and 1 × 10⁷ cells from each gate were barcoded and the sequences for each cell were determined by Illumina next‐generation sequencing. ¹¹ The number of cells collected by each gate for each population was estimated from the proportion of sequencing reads attributed to each population and the number of cells collected into the gates.

Because the number of cells collected by each gate was not made directly available through the proprietary software, we estimated this from the raw exported data. In particular, we imported the data using the FlowCal python package ¹⁶ and computationally implemented the gates and filters (including for forward and backward scatter).

4.6. Sensitivity of maximum likelihood inference to non‐normality of real data

Likelihood‐based inference is a common strategy used with the intent to circumvent the resolution limitation of the histogram approach. ² , ⁹ , ¹⁰ , ¹¹ However, this approach can fail on real data. In particular, existing likelihood methods rely on the assumption that for each of the cell populations the fluorescence values are log normal distributed, $\log F ~ N (μ, σ^{2})$ where the mean log fluorescence $μ = μ_{F}$ is the target of inference and σ ² is the typically unknown variance of the population.

We evaluate performance of maximum likelihood inference in this situation with simulations using data sub‐sampled from a flow cytometry dataset of binding signal of a computationally designed mini‐protein binder to ActRII. Data were collected using yeast display as previously described except with the addition of a supplemental binding protein, protein A, the binding signal $\log (FITC / PE)$ was recorded for approximately 10,00,000 cells. The distribution of this signal is highly non‐Gaussian (Figure S1A).

We first compared the performance of the maximum likelihood approach (described in greater detail below) to the randomized approach on downsampled datasets with N = 250 cells with the same set‐up described in Figure 2. As in the earlier simulations, the randomized approach provides improved MSE across most simulation conditions (Figure S1B). This improvement is again explained by estimation bias, which is mitigated by the randomized approach (Figure S1C). Though one might expect the benefit of maximum likelihood would appear for larger sample sizes (e.g., due to the asymptotic efficiency of maximum likelihood estimation in theory), this is not the case. In fact, due to the bias of maximum likelihood, the relative improvement of the randomized approach is larger at N = 1,000 cells (Figure S1D). Moreover, Figure S1E demonstrates that the maximum likelihood approach does not empirically provide more accurate estimates even under correct specification (with fluorescences sampled as in Figure 2a).

4.7. Maximum likelihood estimation

To estimate $μ_{F},$ likelihood‐based approaches consider the counts in each of K bins (Y ₁, Y ₂, …, Y _K), since the measured fluorescence values cannot be disambiguated when multiple populations are sorted in multiplex. These counts follow a multinomial distribution as

Y_{1}, Y_{2}, \dots, Y_{K} ~ Mult (π (μ, σ^{2}), N),

where $N = \sum_{k = 1}^{K} Y_{k}$ is the total number of cells sorted into any bin and $π (μ, σ^{2}) = (π_{1}, π_{2}, \dots, π_{K})$ are the normalized bin probabilities. In particular, if for each bin k the range of fluorescences collected is from lower bound l _k to upper bound u _k, then

π_{k} = \frac{Φ (\frac{u_{k} - μ}{σ}) - Φ (\frac{l_{k} - μ}{σ})}{\sum_{k' = 1}^{K} Φ (\frac{u_{k'} - μ}{σ}) - Φ (\frac{l_{k'} - μ}{σ})},

where Φ(·) is the cumulative density function of the standard normal. The log likelihood function is then

\log p (Y_{1}, \dots, Y_{K}, μ, σ^{2}) = \log N! - \sum_{k = 1}^{K} \log Y_{k}! + \sum_{k = 1}^{K} Y_{k} \log π_{k},

where the dependence of each π _k on μ and σ ² is left implicit. The maximum likelihood approach is to return μ that maximizes this expression,

{\hat{μ}}_{MLE} = \underset{μ}{\arg \max} [\max_{σ^{2} > 0} \log p (Y_{1}, \dots, Y_{K}, μ, σ^{2})] .

This optimization problem is not analytically tractable, and its constraints and non‐convexity pose challenges for local, gradient‐based optimizers. So we instead solve the optimization approximately with a grid search.

AUTHOR CONTRIBUTIONS

Brian L. Trippe: Conceptualization (equal); formal analysis (equal); funding acquisition (equal); investigation (equal); methodology (equal); software (equal); validation (equal); visualization (equal); writing – original draft (equal); writing – review and editing (equal). Buwei Huang: Data curation (equal); formal analysis (equal); investigation (equal); validation (equal); writing – original draft (equal); writing – review and editing (equal). Erika A. DeBenedictis: Data curation (equal); formal analysis (equal); investigation (equal); validation (equal); writing – original draft (equal); writing – review and editing (equal). Brian Coventry: Data curation (equal); formal analysis (equal); investigation (equal); validation (equal); writing – original draft (equal); writing – review and editing (equal). Nicholas Bhattacharya: Formal analysis (equal); investigation (equal); methodology (equal); writing – original draft (equal); writing – review and editing (equal). Kevin K. Yang: Conceptualization (equal); formal analysis (equal); investigation (equal); methodology (equal); project administration (equal); software (equal); supervision (equal); validation (equal); writing – original draft (equal); writing – review and editing (equal). David Baker: Data curation (equal); investigation (equal); project administration (equal); resources (equal); supervision (equal); validation (equal); writing – original draft (equal); writing – review and editing (equal). Lorin Crawford: Conceptualization (equal); funding acquisition (equal); investigation (equal); methodology (equal); project administration (equal); supervision (lead); visualization (equal); writing – original draft (equal); writing – review and editing (equal).

Supporting information

Appendix S1 Supporting Information

Click here for additional data file.^{(373.6KB, pdf)}

Figure S1. Maximum likelihood estimation of mean fluorescence assuming log normality is non‐robust on real data. (A) Fluorescence values are drawn independently from empirical distributions of binding signal of mini‐protein binders to ActRII obtained by flow cytometry and rescaled to have an range of different standard deviations (0.01 in red, 0.05 in green, and 0.20 in blue, respectively). (B) The ratio of the MSE of estimates from the maximum likelihood approach on histogram counts to unbiased estimates from randomized gates. (C) The estimation bias for the maximum likelihood estimates with the histogram counts (left) versus the randomized gates (right). All points in panels (B) and (C) are the average across 200 replicates, each with N = 250 cells. (D) The ratio of the MSE of the estimates is not substantially different even with a greater number (N = 1,000) of cells or (E) under correct specification with log normal distributed observations.

Click here for additional data file.^{(311.4KB, tif)}

Figure S2. The error of estimates with randomized gates decreases with increasing sample size. Here, we consider N = 100, 250, and 1,000 total cells, respectively. The bottom row shows the relative performances of estimates from histograms and randomized gates across a range of mean log fluorescences in terms of mean squared error (ratios greater than 1 reflect lower error with randomized gates and ratios below 1 reflect lower error with histograms).

Click here for additional data file.^{(274.7KB, tif)}

Figure S3. Randomized gates for (A) estimating population variances as defined by Equation (S1) and (B) optimally estimating population means with 4‐way sorting.

Click here for additional data file.^{(440.9KB, tif)}

Figure S4. Average squared error of standard deviation estimates of various p _F at different sample sizes. Here, we consider (A) N = 1,000, (B) N = 5,000, and (C) N = 25,000 total cells. Depicted are the performances of (left) histogram based estimates and (right) the randomized estimate $\sqrt{{\hat{σ}}^{2}}$ . In panel (D), we show the distributions from which we simulated the data.

Click here for additional data file.^{(489.8KB, tif)}

ACKNOWLEDGMENTS

We would like to thank Sarah Kate Nyquist for helpful conversations and suggestions. Brian L. Trippe would like to acknowledge support from the National Science Foundation Graduate Research Program. Lorin Crawford is supported by a David & Lucile Packard Fellowship for Science and Engineering. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of any of the funders.

Trippe BL, Huang B, DeBenedictis EA, Coventry B, Bhattacharya N, Yang KK, et al. Randomized gates eliminate bias in sort‐seq assays. Protein Science. 2022;31(9):e4401. 10.1002/pro.4401

Review Editor: Aitziber Cortajarena

Funding information David and Lucile Packard Foundation; National Science Foundation

Contributor Information

Brian L. Trippe, Email: blt2114@columbia.edu.

Lorin Crawford, Email: lcrawford@microsoft.com.

REFERENCES

1. Cao L, Goreshnik I, Coventry B, et al. De novo design of picomolar SARS‐CoV‐2 miniprotein inhibitors. Science. 2020;370(6515):426–431. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Cao L, Coventry B, Goreshnik I, et al. Design of protein‐binding proteins from the target structure alone. Nature. 2022;605:551–560. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Kinney JB, Murugan A, Callan CG, Cox EC. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc Natl Acad Sci. 2010;107(20):9158–9163. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Starr TN, Greaney AJ, Hilton SK, et al. Deep mutational scanning of SARS‐CoV‐2 receptor binding domain reveals constraints on folding and ACE2 binding. Cell. 2020;182(5):1295–1310. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Adams RM, Mora T, Walczak AM, Kinney JB. Measuring the sequence‐affinity landscape of antibodies with massively parallel titration curves. Elife. 2016;5:e23156. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Reich L, Dutta S, Keating AE. SORTCERY—A high‐throughput method to affinity rank peptide ligands. J Mol Biol. 2015;427(11):2135–2150. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Peterman N, Levine E. Sort‐seq under the hood: Implications of design choices on large‐scale characterization of sequence‐function relations. BMC Genomics. 2016;17(1):1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. Data mining, inference, and prediction. New York: Springer, 2001. [Google Scholar]
9. de Boer CG, Ray JP, Hacohen N, Regev A. MAUDE: Inferring expression changes in sorting‐based CRISPR screens. Genome Biol. 2020;21:1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Fulco CP, Nasser J, Jones TR, et al. Activity‐by‐contact model of enhancer–promoter regulation from thousands of crispr perturbations. Nat Genet. 2019;51(12):1664–1669. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Rocklin GJ, Chidyausiku TM, Goreshnik I, et al. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science. 2017;357(6347):168–175. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Boder ET, Dane Wittrup K. Yeast surface display for screening combinatorial polypeptide libraries. Nat Biotechnol. 1997;15(6):553–557. [DOI] [PubMed] [Google Scholar]
13. Sharon E, Kalma Y, Sharp A, et al. Inferring gene regulatory logic from high‐throughput measurements of thousands of systematically designed promoters. Nat Biotechnol. 2012;30(6):521–530. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Biswas S, Kuznetsov G, Ogden PJ, Conway NJ, Adams RP, Church GM. Toward machine‐guided design of proteins. bioRxiv. 2018;337154. [Google Scholar]
15. Chevalier A, Silva D‐A, Rocklin GJ, et al. Massively parallel de novo protein design for targeted therapeutics. Nature. 2017;550(7674):74–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Castillo‐Hair SM, Sexton JT, Landry BP, Olson EJ, Igoshin OA, Tabor JJ. Flowcal: A user‐friendly, open source software tool for automatically converting flow cytometry data from arbitrary to calibrated units. ACS Synth Biol. 2016;5(7):774–780. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Luenberger DG. Optimization by vector space methods. New York: John Wiley & Sons, 1969. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix S1 Supporting Information

Click here for additional data file.^{(373.6KB, pdf)}

Click here for additional data file.^{(311.4KB, tif)}

Click here for additional data file.^{(274.7KB, tif)}

Figure S3. Randomized gates for (A) estimating population variances as defined by Equation (S1) and (B) optimally estimating population means with 4‐way sorting.

Click here for additional data file.^{(440.9KB, tif)}

Click here for additional data file.^{(489.8KB, tif)}

[pro4401-bib-0001] 1. Cao L, Goreshnik I, Coventry B, et al. De novo design of picomolar SARS‐CoV‐2 miniprotein inhibitors. Science. 2020;370(6515):426–431. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro4401-bib-0002] 2. Cao L, Coventry B, Goreshnik I, et al. Design of protein‐binding proteins from the target structure alone. Nature. 2022;605:551–560. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro4401-bib-0003] 3. Kinney JB, Murugan A, Callan CG, Cox EC. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc Natl Acad Sci. 2010;107(20):9158–9163. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro4401-bib-0004] 4. Starr TN, Greaney AJ, Hilton SK, et al. Deep mutational scanning of SARS‐CoV‐2 receptor binding domain reveals constraints on folding and ACE2 binding. Cell. 2020;182(5):1295–1310. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro4401-bib-0005] 5. Adams RM, Mora T, Walczak AM, Kinney JB. Measuring the sequence‐affinity landscape of antibodies with massively parallel titration curves. Elife. 2016;5:e23156. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro4401-bib-0006] 6. Reich L, Dutta S, Keating AE. SORTCERY—A high‐throughput method to affinity rank peptide ligands. J Mol Biol. 2015;427(11):2135–2150. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro4401-bib-0007] 7. Peterman N, Levine E. Sort‐seq under the hood: Implications of design choices on large‐scale characterization of sequence‐function relations. BMC Genomics. 2016;17(1):1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro4401-bib-0008] 8. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. Data mining, inference, and prediction. New York: Springer, 2001. [Google Scholar]

[pro4401-bib-0009] 9. de Boer CG, Ray JP, Hacohen N, Regev A. MAUDE: Inferring expression changes in sorting‐based CRISPR screens. Genome Biol. 2020;21:1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro4401-bib-0010] 10. Fulco CP, Nasser J, Jones TR, et al. Activity‐by‐contact model of enhancer–promoter regulation from thousands of crispr perturbations. Nat Genet. 2019;51(12):1664–1669. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro4401-bib-0011] 11. Rocklin GJ, Chidyausiku TM, Goreshnik I, et al. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science. 2017;357(6347):168–175. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro4401-bib-0012] 12. Boder ET, Dane Wittrup K. Yeast surface display for screening combinatorial polypeptide libraries. Nat Biotechnol. 1997;15(6):553–557. [DOI] [PubMed] [Google Scholar]

[pro4401-bib-0013] 13. Sharon E, Kalma Y, Sharp A, et al. Inferring gene regulatory logic from high‐throughput measurements of thousands of systematically designed promoters. Nat Biotechnol. 2012;30(6):521–530. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro4401-bib-0014] 14. Biswas S, Kuznetsov G, Ogden PJ, Conway NJ, Adams RP, Church GM. Toward machine‐guided design of proteins. bioRxiv. 2018;337154. [Google Scholar]

[pro4401-bib-0015] 15. Chevalier A, Silva D‐A, Rocklin GJ, et al. Massively parallel de novo protein design for targeted therapeutics. Nature. 2017;550(7674):74–79. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro4401-bib-0016] 16. Castillo‐Hair SM, Sexton JT, Landry BP, Olson EJ, Igoshin OA, Tabor JJ. Flowcal: A user‐friendly, open source software tool for automatically converting flow cytometry data from arbitrary to calibrated units. ACS Synth Biol. 2016;5(7):774–780. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro4401-bib-0017] 17. Luenberger DG. Optimization by vector space methods. New York: John Wiley & Sons, 1969. [Google Scholar]

PERMALINK

Randomized gates eliminate bias in sort‐seq assays

Brian L Trippe

Buwei Huang

Erika A DeBenedictis

Brian Coventry

Nicholas Bhattacharya

Kevin K Yang

David Baker

Lorin Crawford

Abstract

1. INTRODUCTION

FIGURE 1.

2. RESULTS AND DISCUSSION

2.1. Randomized gates provide superior accuracy to histograms in simulation

FIGURE 2.

2.2. Experimental implementation via shifting gate thresholds

FIGURE 3.

3. CONCLUSION

4. MATERIALS AND METHODS

4.1. Unbiasedness of estimates from randomized gates

Theorem 1

4.2. Trade‐off between number of cells sorted and precision of estimates

4.3. Simulation details

4.4. Experimental approximation of randomized gates with shifting thresholds

4.5. Yeast display and deep sequencing

4.6. Sensitivity of maximum likelihood inference to non‐normality of real data

4.7. Maximum likelihood estimation

AUTHOR CONTRIBUTIONS

Supporting information

ACKNOWLEDGMENTS

Contributor Information

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Randomized gates eliminate bias in sort‐seq assays

Brian L Trippe

Buwei Huang

Erika A DeBenedictis

Brian Coventry

Nicholas Bhattacharya

Kevin K Yang

David Baker

Lorin Crawford

Abstract

1. INTRODUCTION

FIGURE 1.

2. RESULTS AND DISCUSSION

2.1. Randomized gates provide superior accuracy to histograms in simulation

FIGURE 2.

2.2. Experimental implementation via shifting gate thresholds

FIGURE 3.

3. CONCLUSION

4. MATERIALS AND METHODS

4.1. Unbiasedness of estimates from randomized gates

Theorem 1

4.2. Trade‐off between number of cells sorted and precision of estimates

4.3. Simulation details

4.4. Experimental approximation of randomized gates with shifting thresholds

4.5. Yeast display and deep sequencing

4.6. Sensitivity of maximum likelihood inference to non‐normality of real data

4.7. Maximum likelihood estimation

AUTHOR CONTRIBUTIONS

Supporting information

ACKNOWLEDGMENTS

Contributor Information

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases