Abstract
Since the 16th century, assays and screens have been essential for scientific investigation. However, most methods could be significantly improved, especially in accuracy, scalability, and often lack adequate comparisons to negative controls. There is a lack of consistency in distinguishing assays, in which accuracy is the main goal, from screens, in which scalability is prioritized over accuracy. We dissected and modernized the original definitions of assays and screens based upon recent developments and the conceptual framework of the original definitions. All methods have 3 components: design/measurement, performance, and interpretation. We propose a model of method development in which reproducible observations become new methods, initially assessed by sensitivity. Further development can proceed along a path to either screens or assays. The screen path focuses on scalability first, but can later prioritizes analysis of negatives. Alternatively, the assay path first compares results to negative controls, assessing specificity and accuracy, later adding scalability. Both pathways converge on a high-accuracy and throughput (HAT) assay, like next generation sequencing, which we suggest should be the ultimate goal of all testing methods. Our model will help scientists better select among available methods, as well as improve existing methods, expanding their impact on science.
Keywords: High-throughput, assay, screen, accuracy, method, HAT Assay, fabrication
Graphical Abstract
Assays and screens are types of scientific methods, but their relationship is unknown. Review of their definitions and development identify eras and pathways of method development, and factors that distinguish assays from screens. Their merger into high-accuracy and throughput (HAT) assays will likely drive a new era of science.

1. Introduction
One of the key attributes that distinguishes science from lay pursuits is the assay. Assays establish a reproducible test to critically and quantitatively compare an unknown sample to established standards and controls. The first record of an assay was the Cupellation reaction that oxidized non-precious metals and was used to measure specific metal content in ores during the 14th-16th centuries. Shortly thereafter, the first published definition of an assay was “to compare the potency of the particular preparation test with that of a standard preparation of the same substance”.[1] The first application to living matter dates to Paul Ehrlich’s bioassay in the 1890s where a toxin produced by diphtheria induced a toxin antiserum.[2]
Without the numerous assays developed since the concept was introduced, modern science would not exist in its current form and influence on society. The sophistication of assays, their design, and analysis has gradually matured through the centuries, gaining pace in recent decades with the development of high-throughput Omics methods and the influence of data science. In this paper, we critically assess the attributes of biological assays and a related but distinct term “screen”. We review conceptual advances that have developed since their original definitions and propose updated definitions, as well as a new model with pathways for assay and screen development. We also introduce a new method term called “Fabrication” to describe many methods designed to produce or purify a substance.
2. Assays
In the 14th century the term “Assay” was derived from the Anglo-French noun assaier, meaning analysis. One of the first appearances of “assay” in the scientific literature was using the cupellation assay to measure silver content in ore in 1677.[3] The original definition of assay (see Box 1) can be deconvolved into three key components: the design/measurement, performance, and interpretation. Part of the design includes the word “standards” in the definition, which refers to what we now also call controls.[4,5]
BOX 1: BIOLOGICAL METHOD DEFINITIONS.
| Method |
| Original definition: a test involving experiments on organisms.[5] |
| New definition: a procedure to experiment on an organism or material derived from an organism, or to produce or purify a substance(s) for such purpose. |
| Screen |
| Original definition: the sort of physical examination of the child which will recognize deviations from the normal, but which will not result in a final diagnosis. |
| New definition: A scalable biological or medical experimental method for surveying a group to enrich for likely candidates that can be confirmed with an assay. |
| Assay |
| Original definition: to compare the potency of the particular preparation test with that of a standard preparation of the same substance. |
| New definition: A biological or medical experimental method to compare samples to known negative and positive standards with high statistical accuracy. |
| Fabrication |
| New definition: A method to create or purify a new reagent. |
| HAT Assay |
| New definition: High-accuracy and high-throughput assay. |
Today, assay design includes speed, scalability, replicate architecture, and the data type of the measured outcome (nominal, ordinal, discrete, continuous, or image). The data type is the scale of how the “potency” is measured in the original definition. Assay performance is now routinely assessed by the statistics of accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). Lastly, the term “compare” in the definition refers to the interpretation. Modern assay interpretation further includes assessment of confidence, reproducibility, and comparison to known controls. Examples of these components are later discussed in detail.
3. Screens
The initial language used in assay definitions reflect the intent of accuracy and reproducibility.[4] This is not the case for “screening” where high rates of false positives and false negatives are tolerable, in favor of scaling to analyze many samples as a driving factor. The noun “screen” was derived from the old French word escran, meaning fire-screen or tester of a bed, in the early 1,400s.[5] Although, first called a “test” in medical applications, it became later known as a “medical screen” when tuberculosis screening developed between 1891–1907 and later for mental health testing for enlistment into the US Army in 1917.[6,7]
The first definition of “screening” (see Box 1) related to its applications in medicine was published in 1925: “The sort of physical examination of the child which will recognize deviations from the normal, but which will not result in a final diagnosis”.[8] This definition implies a less accurate first-line test, that requires additional confirmation. Thus, the original goal of screening is to identify candidates, whereas the goal of an assay is to make a final determination. Medical screening of the population was adopted to test blood glucose levels for diabetes in the 1940s [9], identifying a person’s blood group in the 1950s, newborn screening for phenylketonuria and karyotype screening in the 1960s, and all are now general approaches for population screening.[10]
Shortly after introduction of the screening concept, Thomas Morgan Hunt created forward genetic screens identifying mutant, white-eyed fruit flies in 1907.[11] However, the method was not called a screen in the original paper.
Later, the new screening methods led to many genetic and biological screens and eventually the Omics era. Other examples are reverse genetic screens, Yeast 2-Hybrid, RNAi screens, CRISPR/Cas screens, and lethality screens such as deep mutation scanning.[12–16] Also important is high throughput screening (HTS), now also called high content screening (HCS) of compound libraries for drug leads started by Pfizer in 1986.[17]
4. Fabrication
We collected a short list of biological methods to examine assay and screen classifications (Table 1). We noticed that there were many methods that did not focus of analysis of samples, rather focused on creating or purifying a new reagent or using a piece of equipment (new definition in Box 1). Examples of Fabrication methods are oligonucleotide synthesis or peptide synthesis, purification of a peptide by high pressure liquid chromatography HPLC. There are several methods that are also used to purify macromolecules (electrophoresis, chromatography, centrifugation) or create something, such as engineered mice, engineered mutants, PCR, CRISPR/Cas9, TALEN, oligonucleotide synthesis, and molecular cloning).
TABLE 1.
Example assays and screen with classification and metadata.
| Method | Inception | Type | Assay/screen | Performance | Era |
|---|---|---|---|---|---|
| Metal content in ore | 1677 | Metal content | Assay | Accuracy | Descriptive |
| Good air | 1774 | Oxygen | Assay | Accuracy | Descriptive |
| Chamberland filter | 1884 | Microbes | Assay | Accuracy | Descriptive |
| Phage display | 1985 | Protein interaction | Screen | Sensitivity | Industrial |
| LC MS/MS | 1989 | Metabolomics | Screen | Sensitivity | Industrial |
| NMR | 1946 | Structure | Assay | Accuracy | Industrial |
| Y2H | 1989 | Protein interaction | Screen | Sensitivity | Industrial |
| Peptide array | 1984 | Protein interaction | Screen | Sensitivity | Industrial |
| Microarray | 1983 | Gene expression | Screen | Sensitivity | Industrial |
| High content screen | 1986 | Drug–protein interaction | Screen | Sensitivity | Industrial |
| Affinity MS/MS | 1999 | Protein interaction and posttranslational modification | Screen | Sensitivity | Industrial |
| RNAi library | 1998 | Gene function | Screen | Sensitivity | Industrial |
| ChipSeq | 2007 | Epigenetics | Screen | Sensitivity | Omics |
| RNAseq | 2006 | Gene expression | Assay | Accuracy | Omics |
| DNAseq (NGS) | 2008 | Genome sequencing | Assay | Accuracy | Omics |
| CRISPR | 2014 | Gene function | Screen | Sensitivity | Omics |
| SELEX | 1990 | Gene function | Screen | Sensitivity | Industrial |
| MAVE/lethality | 2018 | Gene function | Screen | Sensitivity | Omics |
| GigaAssay | 2022 | Gene function | Assay | Accuracy | Omics |
| DNA encoded library | 1992 | Drug–protein interaction | Screen | Sensitivity | Industria |
5. Eras of method development
Historical ordering of methods in Table 1 also reveals some patterns that lead us to suggest 3 different eras of method development (Fig. 1): The “Descriptive” era of method development focused on simple one-step methods that produced observational results such as a popping sound for assessing production of “good air” and “bad air” by photosynthesis.[18,19] This era was limited by the general lack of available reagents and instrumentation, and lasted until the early 1900s.
Figure 1. Eras of method development and essential components of screens and assays.
Key attributes for each component are listed.
During the 20th century the “Industrial” era saw a shift away from simple one-step descriptive methods to multistep assays and screens. As biology and biotechnology developed, fabrication methods matured producing new reagents, sometimes called “Kit science”, and instrumentation that was enabled by the development of electronics. There was also development of better approaches to assess method efficiency, performance, and interpretation. Assays and screens had improvements in accuracy and throughput, respectively.
During the past 35 years, the current “Omics” era emerged. This era saw maturation of many methods with more sophisticated assays and screens featuring either serial or parallel architectures, and the high scalability needed to address increasingly complex questions. Higher power computers, and new analytics and algorithms enabled interpretation of datasets produced from high throughput methods.
We are witnessing a growing impact of data science on the analysis of method performance and interpretation of the larger datasets. We expect the next era is likely to be the HAT era, where many existing and new omics technologies will advance toward HAT assays with both high throughput and high accuracy.
6. Revisiting method definitions
Assays, screens, and our proposed Fabrication method are the general types of biological methods. We next examined the 3 main components of these methods: design/measurement, performance, and interpretation (Fig. 1).[20]
6.1. Design/measurement.
The design/measurement component has several considerations including the data type and method architecture. The main data types are categorical or numerical (see Box 2). Categorical data can be nominal with mutually exclusive labels and no order (e.g., dead or alive in lethality selection; binds or does not bind in phage display), or ordinal, with a ranked order of the labels for a variable (e.g. stage I-IV cancer in histology; high, medium and low expression bins in flow cytometry). Numerical data can be interval or ratio, interval data has distinct and separate values that are evenly spaced (e.g. Celsius temperature scale). Ratio data is similar to interval data, but has an absolute zero as the point of origin (number of colonies in antibiotic resistance; Kelvin temperature scale). Both types of numerical data can be either discrete or continuous. The data types are important for testing performance and interpretation.
BOX 2: DATA TYPES.
| Categorical |
| Nominal |
| Ordinal |
| Numerical |
| Interval |
| Ratio |
The architecture of a method includes its replicate structure, scalability with dropouts, and complex methods connecting one or more methods in series into several steps for a particular test. This more-sophisticated architecture has become more common during the Omics era. The architecture is often designed to increase method performance and improve interpretation. Controls are important at the earliest stage of method development by establishing an example that reliably reproduces a signal (positive control) and one that does not produce a signal (negative control). The controls are essential for evaluating performance and interpretation of unknown samples. The performance of the method drives the types and number of controls. Controls can also be used to evaluate variance and establish the dynamic range of an assay or screen.
The replicate structure is the design of how samples are analyzed in parallel. It identifies variance, is necessary for statistics in performance and interpretation, and therefore are critical to improve methods overtime. An example is Nuclear Magnetic Resonance (NMR) spectroscopy where the signal is weak, and 128–1,024 free induction decay (FID) scans are used to increase the signal/noise ratio.[21]
An important part of method design is scalability, particularly for screens and high throughput assays. Scalability is related to how fast and multiplexed a method can be performed. Although manual scaling was rooted during the descriptive era with the genetics studies of Gregor Mendel, decades later medical screening became the primary driver to scale methods.[22] The Industrial era was essential for scaling many methods and was foundational for the Omics era. Scalability is a combination of increased throughput balanced by the drop out or number of measurements that need to be filtered and discarded. Dropout is the number of measurements that are not used in final interpretation of the results. Next generation sequencing (NGS) is an excellent example where high filtering and dropout are acceptable because of a vast increase in throughput.
There is a current trend in the Omics era toward connecting several methods in series creating more complex methods (e.g. Fig. 1, [15,23,24]). Many of these methods are very complex having multiple data types and complex relationships for the other aspects of method architecture. Qualitative data such as images, videos, concept maps, and graphs with nodes and edges are complex data types that can be represented as combinations of data types. These methods can be evaluated as an overall method, but each part of the method can be evaluated independently. Other relationships appear in a newer series architecture such as the cardinality between method steps. For example, in the GigaAssay, some serial steps leading up to unique molecular identifier (UMI) barcoding have a 1:1 cardinality, whereas the step barcoding each cDNA molecule with the same mutant has a 1:many cardinality and is important for later reducing false positive errors.[23]
6.2. Performance.
The second major component is testing the performance of the method to assess its quality, reliability, and reproducibility. The performance of Fabrication methods is generally assessed by metrics such as percent purity and yield. Performance for assays and screens is the evaluation of knowns, controls, standards, or benchmark datasets.
The earliest scientific methods focused on classification and categorical data types reflecting its origins during the descriptive science era. Performance testing of sensitivity and specificity started with development of serology in early 1900s. [25] Sensitivity compares the classification of the positive controls into true positives (TP) and false negatives (FN) (see Box 3).[26,27] Related to sensitivity, PPV compares TP to all samples that passed the filter, being TP and false positives (FP). Specificity compares the classification of the negative controls into true negatives (TN) and FP. Related to specificity, NPV compares TN to all samples removed by the filter, being TN and FN. Accuracy is a combined measure of sensitivity and specificity. The filter can be any method including computation methods that segregate and label data.
BOX 3: METRICS AND FORMULAS.
| Sensitivity = TP / (TP+FN) |
| Specificity = TN / (TN+FP) |
| Accuracy = (TP +TN) / (P+N) |
| PPV = TP / (TP+FP) |
| NPV = TN / (TN+FN) |
In addition to these metrics, Receiver Operator Characteristic (ROC) curves are commonly used to plot how variables effects the relationship between sensitivity and specificity.[28] These types of analyses are appropriate for assays, but not for screens where TN are unknown and/or negatives and specificity are not the intended focus.
Hypothesis testing is a standard approach for assessing performance of numerical data types. A statistical model is used with hypothesis testing to produce a confidence level that is assigned a p value. Thus, a scientist can select a confidence level to apply the classification labels of “significant” or “not significant”; p < 0.05 is a well-accepted threshold for scientific investigation. The p value is the measure of observing the data under the null hypothesis. Effectively, when the null hypothesis is rejected, the p value is the percent chance of the rejection being incorrect.
Performance assessment of numerical data hypothesis tests each sample. This allows performance to be assessed by sensitivity and specificity. Altering the threshold changes the ratio of TN, FN, FP, and TP (Fig. 2). The threshold can be optimized to maximize accuracy. When numerical data is gathered it should always be analyzed with a statistical model, rather than converting it to categorical data. Classification of numerical data imposes an interpretation of the raw data before performance analysis that is not necessary and subject to information loss and bias.
Figure 2. How thresholds affect performance metrics.
A. A hypothetical experiment using a threshold to optimize the accuracy (89%). B. A hypothetical experiment using a threshold typical of a screen where the experiment focuses on capturing positives, but also captures a significant portion of negatives, labeled FP. C A hypothetical experiment where the dynamic range of the response variable is increased to separate positives and negatives.
A recent trend toward multistep serial assays creates new complications in performance analysis. A consensus approach to performance analysis should emerge as more of these methods are created. There is an opportunity to build a pipeline with steps that are assays and screens that were previously validated. Although the statistical models for these types of methods can be very complex, each step can sometimes be independently assessed for performance and the overall method can be tested as well. Other noteworthy aspects of performance assessment are the likely intrinsic bias of both the journal and scientific team against publishing papers reporting a relatively low accuracy. This social aspect of science can hinder continued development of a method. Measuring performance is essential for the continued optimization of a method because reliable measurements are the first step toward measuring the outcome of changing a variable. For this reason, we also expect the utility of bioinformatics and data science will become more and more important for method testing and development.
6.3. Interpretation.
The third major component of methods is interpretation. Interpretation is comparing measurements of unknown samples to controls are used to draw conclusions with an associated confidence. During the descriptive era, conclusions were purely observational and were not based on statistical models.
Interpretation advanced in the industrial era through the introduction of statistical models. Interpretation can be done in many ways and is largely reliant on the hypotheses that are being tested. Ultimately, interpretation allows information to be extracted from the data generated to test the hypothesis. This is performed with a statistical model, returning a p value which scientists used to make a conclusion. A threshold of p < 0.05 is commonplace; however, this threshold is arbitrary, and review of this threshold may improve conclusions. In the Omics era, this p value is often corrected for false discovery due to testing large samples with multiple testing corrections.[29,30]
In methodology, unknowns are assessed by gathering data and comparing behavior to the known controls. For whatever threshold of confidence is chosen, unknowns’ behavior is classified as similar to controls with which they are significantly associated. When known controls themselves are tested through a method, they can be classified into TP, TN, FP, and FN categories based on the data collected from the method, and the results of the statistical model. The performance of these controls is a major focus of method development, as TP and TN increase, the method is more accurate and reliable. Fig. 2 shows how positive and negative controls are related, and how accuracy can be optimized to improve performance of a method.
For numerical data, thresholds can be used to convert to categorical data so that labels can be applied as for categorical data. A better example approach is to directly compares the distributions of measurements with statistical models. Other types of analyses can be applied to numerical data such as PC analysis, clustering, linear regression, and Pearson correlation. Despite such data science advances the selection of p values, typically p < 0.05 is essentially an arbitrary threshold selected by a scientist.
6.4. Supporting the trend to avoid thresholding of numerical data.
Historically, it was, and is still commonplace, although less so, to apply a threshold to an experimental data set of a numerical datatype, thereby converting it to categorical data with labels. For example, in proteomics, a threshold for the number of peptides matching a known protein is used to determine if that protein was present in the mixture. The utility of the threshold is that it allows dichotomous separation of data, which produces negatives and can then be assessed for accuracy and other performance statistics.
Fig. 2 shows a hypothetical example in which applying different thresholds to a numerical dataset changes sensitivity, specificity, and accuracy. From this figure is clear that threshold selection can change the conclusions of the experiment, so it is very important how a threshold is selected. There are several approaches used to select thresholds. Thresholds can be set by measuring the values of one or more with known negative and positive controls. Alternatively, a threshold can be selected to optimize an accuracy metric. Furthermore, experimental parameters can be altered to optimize the threshold and hence performance of the method. However, it should be recognized that when the threshold is selected by the scientist, it is prone to subjectivity and create bias for a desired result.
We contend that in many cases selecting a threshold is not necessary, introduces bias, is an artifact leftover from industrial era science, and needs to be systemized. We suspect that most, if not all data can be collected as numerical data, but the tendency is to simplify performance testing and interpretation through classification. Furthermore, information is lost during conversion to categorical data types. For example, color detection is often described with labels such as red, yellow, or blue. However, there is far more information in the numerical data of a spectra, being frequencies and amplitudes. Lastly, data science methods for assessing performance and interpretation of numerical data such as principal component analysis, clustering algorithms, and statistics for comparing populations of samples and routine and robust.
The cases where thresholds are likely appropriate is when nature selects a threshold. Simple assays from the descriptive era often have natural threshold. Examples of a natural threshold are: 1. live or dead cells; and 2. Gas, liquid and solid phases are natural thresholds of atomic bonds and density. Quantum theory is nature’s way of selecting threshold. We moved toward the practice of converting data during the industrial era of method development. However, scientists often select thresholds by other methods and in these cases, it is likely better to analyze the numerical data correctly without conversion to categorical data. A clear example of this is the current practice of selecting gates during a flow cytometry experiment, which are hand-drawn to encompass cell populations and can be considered an observational clustering by a scientist.
7. Method development pathways.
Examining collected information for assays and screens (Table 1) inspired us to propose a general model with pathways for the development of assays and screens (Fig. 3). During the foundational Descriptive era of science, methods emerged from observations of nature where measurements were a classification (nominal data type), often dichotomous (with two labels), or a graded trend (ordinal, interval, or ratio data types; Step 1). Many of the earliest methods such as “Chamberland filters”, “good air”, and “spoilage” are dichotomous examples, reflecting the earliest scientific trends toward observation and classification, rather than quantitative measurement. As scientist demonstrated reproducible observations, this established positive and negative controls (Step 3). The measurement can be quantitative or ordered with scalar or ordinal variable types such as number of sick people or organisms (Step 2).
Figure 3. Flow diagram for method developmental pathways.
Each era of method development has a different color. Specific steps in method development are filled yellow circles.
The early methods often had Boolean thresholds set by nature such as ignition of good air that pops in a photosynthesis assay, alive or dead in a survival assay, and number of bacterial colonies in a Chamberland filter eluent.
There were three main advances during the Industrial era, more modern methods with mass produced instruments and reagents, evaluation of method performance, and less subjective interpretations based on statistical tests. Methods with dichotomous classification set by a natural threshold were reliable, but methods developed with different data types and threshold were initially arbitrarily selected. However, quantitatively assessing method performance provided a metric to test whether different classification thresholds or other optimizations could improve method performance.
New methods had three fates: replacement, development into a screen, or development into an assay. Methods may not be able to achieve sufficient separation between positives and negatives (Fig. 2a or 2b), may be replaced by better technology, or may not be readily scalable and thus become prone to replacement. For example, a canary in a coal mine is no longer used, being replaced by carbon monoxide sensors. Most descriptive era methods were replaced.
The second fate of a method is to develop into a screen, in which any data type can be measured. Screens both identify positives and eliminate negatives, but have a threshold set to tolerate false measurement (Fig. 2a). A tradeoff is that screens prioritize scaling to test a larger number of unknowns and for this reason have become a mainstay of scientific investigation (Fig. 3, step 4a). Example types of common screens are medical screening, forward and reverse genetic screening, and protein-protein interaction screens such as the yeast 2-hybrid screen.[9,11,12,31]
In some cases, negative are not measured, but are inferred. For example, phage display measures binding of different protein sequences to a target, but until recently did not sequence the phage in the unbound fraction, thus negatives were not measured. Without negatives, standard method performance statistics of NPV, specificity, and accuracy cannot be determined. Therefore, without rigorous performance evaluation such methods should be considered screens and you should not see reporting of a NPV specificity, or accuracy.
The third fate of a method is to become an assay (Fig. 3, step 4b). Assays compare both unknowns to both TN and TP, which enable the performance measurement of specificity, NPV, and accuracy (Step 4b). Examples of are Western blotting and immunohistochemistry that require highly specific antibody reagents for accuracy. NMR structure determination has an accuracy of 99.9%, but is a low throughput assay.[32]
The continued development of an assay takes two different paths during the Omics era. When a screen is initially setup it is not likely to have 100% sensitivity, but this can be improved to produce a high-sensitivity screen (Step 5a). Alternatively, screens can adapt comparison of unknows to true negatives and convert to a High-throughput assay. An example is phage display, which was initially setup as a screen to identify sequences of proteins or peptides that bind to another molecule.[33] The initial screen did not sequence phage from the unbound fraction and did not evaluate true negatives/ However, NGS was cheaper and faster enabling the measurement of negatives and phage display became an assay where accuracy is measured.[34]
During the Omics era, assays also had two paths of development. Assay can be converted to a high-throughput assay by scaling the assay throughput (Step 5b). Optimizing accuracy can guide improvement of the assay to produce a high-accuracy assay (Step 5c). Either the high-throughput assay or the high-accuracy assay can be further improved by improving the accuracy or adding scalability, respectively, producing a high-through and high-accuracy (HAT) assay (Step 6). The high-sensitivity screen produced in Step 5a can also be converted to a high throughput assay (Step 5b) or a HAT assay (Step 6) by reaching adequate specificity.
We contend that the HAT assay in Step 6 should be the aspirational goal of any method that is an assay or screen as it is the only step that can measure any of the previous steps. There is not set accuracy measurement for determining a HAT assay, however we propose 99% with analysis of over 1,000 samples as a starting point, but overtime, these numbers should improve. The scalability of samples should also consider drop-out rates, which reduces the final number of samples analyzed. In biological sciences, we know of only one method that qualifies with these metrics, Next Generation Sequencing (NGS).
As many medical screens are now converted into assays, a recent analysis of 144 medical screening tests show that a minor faction have both high sensitivity and specificity (>99%), and would qualify as a HAT assay if they were multiplexed with robotics to analyze 1,000s of samples.[35] However, the high accuracy is almost always dependent on the clinical sample tested. An example would be a saliva polymerase chain reaction assay for cytomegalovirus.[36] We have not done a thorough review, but expect that many medical tests will reach the accuracy needed for a HAT assay. Even though they are still referred to in the literature as screens even though under our new definitions, they would be considered assays.
8. The first HAT assay – NGS
At the start of the Omics era, nucleotide blotting assays were eventually scaled to microarrays measuring the expression levels of 10,000 of transcripts at once.[37] Affymetrix arrays emerged as the leading microarray technology highly parallelized with multiple oligonucleotides hybridizing to different regions of each transcript. Specificity was an early problem for the technology with at least 30% false positives.[38] This parallelized assay architecture reduces false positives. While the scalability of the microarray assay, the accuracies never approached 99% definition for a HAT assay, with one of the highest measurements at 83%.[39] Microarrays were eventually supplanted by RNASeq with higher accuracies and increased throughput.[40,41]
In reviewing major Omics technologies in Table 1, DNA and RNA NGS are the only HAT assays, whereas the other 11 high-throughput technologies are screens. This is likely why NGS has such a vast influence on modern science.[42,43] We ask the question as to why this method has developed to such a strong and influential assay? This is in part due to an advancement in assay architecture. Initial scaling of the techniques to use 4 distinct dideoxynucleotides (ddG, ddA, ddC, and ddC) in separate reactions created a new type of assay architecture where 4 separate mutually exclusive measurements of each position can be used to identify negatives in three of the 4 base assays and identify a positive for one of the bases. This architecture effectively rules out the majority of false positives. This assay architecture combined with the throughput scaling over decades from single samples to multi-lane gels, fluorescent dideoxy nucleotides with 96 lane capillary electrophoresis, to sequencing by synthesis with microfluidics enable the most modern form of a HAT assay.
9. Bioinformatics and Data Science
The fundamental components of assays and screen are relying more and more on bioinformatics and are driving science toward the modern HAT assay era. While computational biology contributes to measurements and processing of raw data, Bioinformatics has had a far vaster impact on performance scalability and accuracy, as well as interpretation.
Science in the descriptive and industrial eras was limited by small samples sizes and non-automated calculations. For example, you could align two protein sequences by hand, but with the advent of computational biology, sequence alignment became scalable, first with the Needleman Wunsch algorithm [44]. The Protein Atlas database collected proteins sequences, and with this and eventually more advanced algorithms scaled the ability to compare proteins.[45] In fact, modern genomics would be very limited without the Burrows-Wheeler transform algorithm.[46] These concepts, more broadly applied affects the scaling of measurements and throughput.
This transition between eras also transforms the confidence in scientific conclusions, now assessed with statistical models and tests. For example, comparing two sequences of high similarity can be tested for an evolutionary relationship. Such experiments often produce statistics where the probability that the two sequences do not come from a common ancestor is less than 1 in the number of particles in the universe [47] .That is high confidence! Because of this increase, older measurement of far lower throughput that define “knowns” may not validate with modern high scale and accuracy measurements. Science is redefining knowns with the advent of bioinformatics.
Bioinformatics has enabled the use of larger datasets with numerous sophisticated algorithms, and produced many new databases in the Omics era. These datasets enable artificial intelligence (AI) and deep learning, expanding many new areas in biology. The resulting increases in scalability and accuracy, and redefining true measurements are driving science into the HAT era.
10. CONCLUSIONS AND OUTLOOK
Our critical review of scientific methods reveals several key principles that we include in the first method treatise:
Methods have components of design, performance, and interpretation
Assays are distinguished from screens by assessing behaviors of true negatives and negatives
Many methods produce or purify a new reagent and do not have a classification, thus we designated this class as “fabrication” methods
We recognize three eras of method maturation named descriptive, industrial, and omics
HAT assays are high accuracy (>99%) and have throughput (n ≥ 1,000) and should be a developmental goal of all assays and screens
There are several pathways of method development leading from initial observations ultimately to HAT assays
Numerical data should be directly analyzed by hypothesis testing, rather than labeling data based on thresholds
This treatise helps to identify opportunities and pathways for further development of biomedical methods. Although NGS is the only biological HAT assay with an error rate below 1%, some techniques are poised for transition to a HAT assay. Flow Cytometry, High Content Screening and the GigaAssay all meet the throughput qualification, and have accuracies >90%, thus are close to becoming HAT assays.[17,23,48–51] Flow Cytometry, the GigaAssay, and High content screening are all testing methods analyzing >1,000 samples and have high accuracies that are approaching >90%.[17,23,52] Flow Cytometry is a high throughput approach that separates cell based on a fluorescent signal in which 10,000 of single cells can be measured and sorted rapidly. However, even under optimal conditions the accuracy of cell sorting the best achievable accuracies are ~92%. The GigaAssay measured the function of ~10,000 of genetic mutations in a gene in a one-pot assay. This approach combines NGS and Flow Cytometry, also taking advantage of a unique highly parallelized assay architecture with UMI barcoding and signal averaging. The accuracy of the GigaAssay is 95%.
A lesson learned from examining the success of these techniques is that the design and architecture seem to be essential for producing HAT assays. For example, the GigaAssay using dozens of independent measurements from UMI-coded cDNAs and cells such that errors from a lentiviral insertion into a genomics site that silences expression, or an incorrect variant call for a particular UMI from bioinformatic analysis do not produce false positives. Rather these anomalies slightly increase the standard deviation and are captured in the statistical model. This architecture effectively eliminates most false positive or negative results.
Method development that follows similar design concepts will gradually transition science from the Omics era into the HAT era producing more and more HAT assays (Fig. 3). As more assays and screens approach HAT status, these methods will provide a source of examples for continued development of other Omics methods. The HAT era will advance science because it will scale the ability to produce highly reliable measurements. Large accurate datasets are a limitation for training artificial intelligence (AI) models. Thus, AI is expected to become more effective during the HAT era. For this reason, we think an important aspect of the treatise presented herein is to make scientist aware of the need to continue development of methods into HAT assays.
Supplementary Material
12. ACKNOWLEDGEMENTS
We thank Drs. Liz Valente and Jerome Rotter for useful discussions. We also acknowledge James Raymond for his help with editing parts of the manuscript. This works was supported by grants from the National Institutes of Health (P20 GM121325), the Governor’s Office of Economic Development (1547526), and the Prabhu endowed professorship.
11. ABBREVIATIONS
- HAT assay
high accuracy and throughput assay
- TP
true positive
- FP
false positive
- TN
true negative
- FN
false negative
- PPV
positive predictive value
- NPV
negative predictive value
- NGS
next generation sequencing
- HTS
high throughput screening
- HSC
high content screening
- NMR
nuclear magnetic resonance
- FID
free induction decay
- UMI
unique molecular identifier
- ROC
receiver operator characteristic
- TEV
tobacco etch virus
Footnotes
CONFLICT OF INTEREST
The authors declare that there are no conflicts of interest.
14. REFERENCES
- 1.Finney DJ (1946). Principles of Biological Assay. 9, 46–91. [Google Scholar]
- 2.van Noordwijk J. (1989). Bioassays in whole animals. Journal of Pharmaceutical and Biomedical Analysis, 7(2), 139–145. 10.1016/0731-7085(89)80077-9 [DOI] [PubMed] [Google Scholar]
- 3.Merret C. (1667). A Relation of the Tinn-Mines, and Working of Tin in the County of Cornwall. Philosophical Transactions, 12, 949–952. [Google Scholar]
- 4.Bliss CI, & Cattell MK (1943). Biological Assay. Annual Review of Physiology, 5(1), 479–539. 10.1146/annurev.ph.05.030143.002403 [DOI] [Google Scholar]
- 5.Onlne Entymology Dictionary. (n.d.). https://www.etymonline.com [Google Scholar]
- 6.Koch R. (1891). A Further Communication on a Remedy for Tuberculosis: Translated from the Original Article Published in the “Deutsche Medicinische Wochenschrift,” and Published as a Special Supplement to the “British Medical Journal” of November 15th. The Indian Medical Gazette, 26(1), 16–20. [Google Scholar]
- 7.Daniel TM (2006). The history of tuberculosis. Respiratory Medicine, 100(11), 1862–1870. 10.1016/j.rmed.2006.08.006 [DOI] [PubMed] [Google Scholar]
- 8.Champion ME (1925). SHOULD THE HEALTH EXAMINATION BE A SCREENING OR A DIAGNOSIS? American Journal of Public Health, 15(12), 1083–1085. 10.2105/AJPH.15.12.1083 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Morabia A. (2004). History of medical screening: From concepts to action. Postgraduate Medical Journal, 80(946), 463–469. 10.1136/pgmj.2003.018226 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Nowell PC (1962). The minute chromosome (Phl) in chronic granulocytic leukemia. Blut, 8, 65–66. 10.1007/BF01630378 [DOI] [PubMed] [Google Scholar]
- 11.Conklin EG (1908). Experimental Zoology. By Thomas Hunt Morgan, Professor of Experimental Zoology in Columbia University. New York, The Macmillan Company. 1907. Science, 27(682), 139–140. 10.1126/science.27.682.139 [DOI] [Google Scholar]
- 12.Fields S, & Song O. (1989). A novel genetic system to detect protein-protein interactions. Nature, 340(6230), 245–246. 10.1038/340245a0 [DOI] [PubMed] [Google Scholar]
- 13.Starita LM, Islam MM, Banerjee T, Adamovich AI, Gullingsrud J, Fields S, Shendure J, & Parvin JD (2018). A Multiplex Homology-Directed DNA Repair Assay Reveals the Impact of More Than 1,000 BRCA1 Missense Substitution Variants on Protein Function. The American Journal of Human Genetics, 103, 498–508. 10.1016/j.ajhg.2018.07.016 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Morris JC, Wang Z, Drew ME, & Englund PT (2002). Glycolysis modulates trypanosome glycoprotein expression as revealed by an RNAi library. The EMBO Journal, 21(17), 4429–4438. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Shalem O, Sanjana NE, Hartenian E, Shi X, Scott DA, Mikkelson T, Heckl D, Ebert BL, Root DE, Doench JG, & Zhang F. (2014). Genome-scale CRISPR-Cas9 knockout screening in human cells. Science (New York, N.Y.), 343(6166), 84–87. 10.1126/science.1247005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Findlay GM, Daza RM, Martin B, Zhang MD, Leith AP, Gasperini M, Janizek JD, Huang X, Starita LM, & Shendure J. (2018). Accurate classification of BRCA1 variants with saturation genome editing. Nature, 562(7726), 217–222. 10.1038/s41586-018-0461-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Pereira DA, & Williams JA (2007). Origin and evolution of high throughput screening. British Journal of Pharmacology, 152(1), 53–61. 10.1038/sj.bjp.0707373 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Preistley J. (1774). EXPERIMENTS AND OBSERVATIONS ON DIFFERENT KINDS OF AIR (Vol. 1). J. Johnson. [Google Scholar]
- 19.Igenhousz J. (1779). Experiments upon Vegetables, Discovering Their great Power of purifying the Common Air in the Sun-shine, and of Injuring it in the Shade and at Night. To Which is Joined, A new Method of examining the accurate Degree of Salubrity of the Atmosphere. In A Source Book in Chemistry 1400–1900. Henry Marshall Leicester and Herbert S. Klickstein. [Google Scholar]
- 20.Merriam-Webster. (n.d.). https://www.merriam-webster.com/
- 21.Becker ED (1993). A BRIEF HISTORY OF NUCLEAR MAGNETIC RESONANCE. Analytical Chemistry, 65(6), 295A–302A. 10.1021/ac00054a716 [DOI] [PubMed] [Google Scholar]
- 22.Abbott S, & Fairbanks DJ (2016). Experiments on Plant Hybrids by Gregor Mendel. Genetics, 204(2), 407–422. 10.1534/genetics.116.195198 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Benjamin R, Giacoletto CJ, FitzHugh ZT, Eames D, Buczek L, Wu X, Newsome J, Han MV, Pearson T, Wei Z, Banerjee A, Brown L, Valente LJ, Shen S, Deng H-W, & Schiller MR (2022). GigaAssay – An adaptable high-throughput saturation mutagenesis assay platform. Genomics, 114(4), 110439. 10.1016/j.ygeno.2022.110439 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Rigaut G, Shevchenko A, Rutz B, Wilm M, Mann M, & Séraphin B. (1999). A generic protein purification method for protein complex characterization and proteome exploration. Nature Biotechnology, 17(10), 1030–1032. 10.1038/13732 [DOI] [PubMed] [Google Scholar]
- 25.Binney N, Hyde C, & Bossuyt PM (2021). On the Origin of Sensitivity and Specificity. Annals of Internal Medicine, 174(3), 401–407. 10.7326/M20-5028 [DOI] [PubMed] [Google Scholar]
- 26.Baratloo A, Hosseini M, Negida A, & El Ashal G. (2015). Part 1: Simple Definition and Calculation of Accuracy, Sensitivity and Specificity. Emergency (Tehran, Iran), 3(2), 48–49. [PMC free article] [PubMed] [Google Scholar]
- 27.Altman DG, & Bland JM (1994). Diagnostic tests. 1: Sensitivity and specificity. BMJ (Clinical Research Ed.), 308(6943), 1552. 10.1136/bmj.308.6943.1552 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Zou KH, O’Malley AJ, & Mauri L. (2007). Receiver-Operating Characteristic Analysis for Evaluating Diagnostic Tests and Predictive Models. Circulation, 115(5), 654–657. 10.1161/CIRCULATIONAHA.105.594929 [DOI] [PubMed] [Google Scholar]
- 29.Hollestein LM, Lo SN, Leonardi-Bee J, Rosset S, Shomron N, Couturier D-L, & Gran S. (2021). MULTIPLE ways to correct for MULTIPLE comparisons in MULTIPLE types of studies. British Journal of Dermatology, 185(6), 1081–1083. 10.1111/bjd.20600 [DOI] [PubMed] [Google Scholar]
- 30.So H-C, & Sham PC (2011). Multiple testing and power calculations in genetic association studies. Cold Spring Harbor Protocols, 2011(1), pdb.top95. 10.1101/pdb.top95 [DOI] [PubMed] [Google Scholar]
- 31.Hardy S, Legagneux V, Audic Y, & Paillard L. (2010). Reverse genetics in eukaryotes. Biology of the Cell, 102(10), 561–580. 10.1042/BC20100038 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Fowler NJ, Sljoka A, & Williamson MP (2020). A method for validating the accuracy of NMR protein structures. Nature Communications, 11(1), 6321. 10.1038/s41467-020-20177-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Smith GP (1985). Filamentous fusion phage: Novel expression vectors that display cloned antigens on the virion surface. Science (New York, N.Y.), 228(4705), 1315–1317. [DOI] [PubMed] [Google Scholar]
- 34.Dias-Neto E, Nunes DN, Giordano RJ, Sun J, Botz GH, Yang K, Setubal JC, Pasqualini R, & Arap W. (2009). Next-Generation Phage Display: Integrating and Comparing Available Molecular Tools to Enable Cost-Effective High-Throughput Analysis. PLoS ONE, 4(12), e8338. 10.1371/journal.pone.0008338 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Maxim LD, Niebo R, & Utell MJ (2014). Screening tests: A review with examples. Inhalation Toxicology, 26(13), 811–828. 10.3109/08958378.2014.955932 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Boppana SB, Ross SA, Shimamura M, Palmer AL, Ahmed A, Michaels MG, Sánchez PJ, Bernstein DI, Tolan RW, Novak Z, Chowdhury N, Britt WJ, Fowler KB, & National Institute on Deafness and Other Communication Disorders CHIMES Study. (2011). Saliva polymerase-chain-reaction assay for cytomegalovirus screening in newborns. The New England Journal of Medicine, 364(22), 2111–2118. 10.1056/NEJMoa1006561 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.DeRisi JL, Iyer VR, & Brown PO (1997). Exploring the metabolic and genetic control of gene expression on a genomic scale. Science (New York, N.Y.), 278(5338), 680–686. 10.1126/science.278.5338.680 [DOI] [PubMed] [Google Scholar]
- 38.Choe SE, Boutros M, Michelson AM, Church GM, & Halfon MS (2005). Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biology, 6(2), R16. 10.1186/gb-2005-6-2-r16 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Sandberg R, & Larsson O. (2007). Improved precision and accuracy for microarrays using updated probe set definitions. BMC Bioinformatics, 8(1), 48. 10.1186/1471-2105-8-48 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Corchete LA, Rojas EA, Alonso-López D, De Las Rivas J, Gutiérrez NC, & Burguillo FJ (2020). Systematic comparison and assessment of RNA-seq procedures for gene expression quantitative analysis. Scientific Reports, 10(1), 19737. 10.1038/s41598-020-76881-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.SEQC/MAQC-III Consortium. (2014). A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nature Biotechnology, 32(9), 903–914. 10.1038/nbt.2957 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Metzker ML (2005). Emerging technologies in DNA sequencing. Genome Research, 15(12), 1767–1776. 10.1101/gr.3770505 [DOI] [PubMed] [Google Scholar]
- 43.Salk JJ, Schmitt MW, & Loeb LA (2018). Enhancing the accuracy of next-generation sequencing for detecting rare and subclonal mutations. Nature Reviews Genetics, 19(5), 269–285. 10.1038/nrg.2017.117 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Needleman SB, & Wunsch CD (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443–453. 10.1016/0022-2836(70)90057-4 [DOI] [PubMed] [Google Scholar]
- 45.Hersh RT, Eck RV, & Dayhoff MO (1967). Atlas of Protein Sequence and Structure, 1966. Systematic Zoology, 16(3), 262. 10.2307/2412074 [DOI] [Google Scholar]
- 46.Li H, & Durbin R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (Oxford, England), 25(14), 1754–1760. 10.1093/bioinformatics/btp324 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Altschul SF, Gish W, Miller W, Myers EW, & Lipman DJ (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403–410. 10.1016/S0022-2836(05)80360-2 [DOI] [PubMed] [Google Scholar]
- 48.Savage EC, Vanderheyden AD, Bell AM, Syrbu SI, & Jensen CS (2011). Independent Diagnostic Accuracy of Flow Cytometry Obtained From Fine-Needle Aspirates. American Journal of Clinical Pathology, 135(2), 304–309. 10.1309/AJCPHY69XVJGULKO [DOI] [PubMed] [Google Scholar]
- 49.Li S, & Xia M. (2019). Review of high-content screening applications in toxicology. Archives of Toxicology, 93(12), 3387–3396. 10.1007/s00204-019-02593-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Pfeiffer F, Gröber C, Blank M, Händler K, Beyer M, Schultze JL, & Mayer G. (2018). Systematic evaluation of error rates and causes in short samples in next-generation sequencing. Scientific Reports, 8(1), 10950. 10.1038/s41598-018-29325-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Benjamin R, Giacoletto C, FitzHugh Z, Eames D, Buczek L, Wu X, Newsome J, Han M, Pearson T, Wei Z, Banerjee A, Brown L, Valente L, Shen S, Deng H-W, & Schiller M. (2022). Data Supporting a saturation mutagenesis assay for Tat-driven transcription with the GigaAssay. Data in Brief, 45, 108641. 10.1016/j.dib.2022.108641 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Picot J, Guerin CL, Le Van Kim C, & Boulanger CM (2012). Flow cytometry: Retrospective, fundamentals and recent instrumentation. Cytotechnology, 64(2), 109–130. 10.1007/s10616-011-9415-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



