Unsupervised selection of optimal single-molecule time series idealization criterion

Argha Bandyopadhyay; Marcel P Goldschen-Ohm

doi:10.1016/j.bpj.2021.08.045

. 2021 Sep 4;120(20):4472–4483. doi: 10.1016/j.bpj.2021.08.045

Unsupervised selection of optimal single-molecule time series idealization criterion

Argha Bandyopadhyay ¹, Marcel P Goldschen-Ohm ^1,^∗

PMCID: PMC8553667 PMID: 34487708

Abstract

Single-molecule (SM) approaches have provided valuable mechanistic information on many biophysical systems. As technological advances lead to ever-larger data sets, tools for rapid analysis and identification of molecules exhibiting the behavior of interest are increasingly important. In many cases the underlying mechanism is unknown, making unsupervised techniques desirable. The divisive segmentation and clustering (DISC) algorithm is one such unsupervised method that idealizes noisy SM time series much faster than computationally intensive approaches without sacrificing accuracy. However, DISC relies on a user-selected objective criterion (OC) to guide its estimation of the ideal time series. Here, we explore how different OCs affect DISC’s performance for data typical of SM fluorescence imaging experiments. We find that OCs differing in their penalty for model complexity each optimize DISC’s performance for time series with different properties such as signal/noise and number of sample points. Using a machine learning approach, we generate a decision boundary that allows unsupervised selection of OCs based on the input time series to maximize performance for different types of data. This is particularly relevant for SM fluorescence data sets, which often have signal/noise near the derived decision boundary and include time series of nonuniform length because of stochastic bleaching. Our approach, AutoDISC, allows unsupervised per-molecule optimization of DISC, which will substantially assist in the rapid analysis of high-throughput SM data sets with noisy samples and nonuniform time windows.

Significance

The divisive segmentation and clustering (DISC) algorithm is a computationally efficient and accurate algorithm for idealizing noisy single-molecule time series. Although largely unsupervised, DISC requires user selection of a guiding objective criterion. We show that different criteria are optimal for different types of data. Critically, single-molecule fluorescence data sets typically exhibit stochastic variation in data properties from molecule to molecule. Here, we extend DISC to automate the optimal choice of criterion on a per-molecule basis. This advance, AutoDISC, provides a practical solution for automating analysis of high-throughput single-molecule fluorescence data sets increasingly common because of developments in camera and dye technologies.

Introduction

Recent advances in detector and imaging technologies have enabled increasingly high-throughput single-molecule (SM) data collection (1, 2, 3). For example, faster scientific complementary metal-oxide-semiconductor (sCMOS) cameras with larger chips coupled with photobleach-resistant fluorophores allow simultaneous recording of hundreds of molecules per field of view for extended durations (4, 5, 6). Tools for rapid unsupervised analysis of such large data sets are increasingly critical to avoid becoming the bottleneck for experiment progress (7,8).

Hidden Markov models (HMMs) are a widely successful approach for SM time series analysis (9, 10, 11, 12). However, global analysis of large high-throughput SM data sets with a specific HMM is challenging for data sets with per-molecule variation in behavior or state emission and noise amplitudes. For example, such variation is typical of camera-based imaging modalities that often contain hundreds to thousands of molecules in which only a subset of molecules exhibits the behavior of interest. Furthermore, spatial nonuniformities in the optical pathways and/or illumination give rise to per-molecule variation in signal/noise, which can be exacerbated by motion within steeply varying excitation fields (e.g., total internal reflection fluorescence microscopy) or fluorophore photodynamics (13, 14, 15). Finally, HMMs require postulation of a molecular mechanism (i.e., specification of a model’s states and the allowed transitions between them), which may be unknown a priori (9). Although methods exist for selecting a model from a set of HMMs (16), this poses a heavy computational burden proportional to the size of the test model set. Unsupervised approaches for HMM model selection, e.g., infinite HMMs (17,18) and deep learning neural networks (19, 20, 21), automate the process of model identification but remain computationally expensive or require extensive training data sets before their use. This is not to say that one should not use an HMM if the data allow it. However, efficient unsupervised approaches that do not require postulation of a specific mechanism or pretraining on similar known data sets allow rapid screening and/or analysis of high-throughput data sets even in the presence of per-molecule variability as discussed above. Even in cases in which analysis with an HMM is ultimately desired, rapid screening with unsupervised approaches can aid in guiding the choice of experimental conditions or identifying subsets of molecules with particular behaviors of interest.

The divisive segmentation and clustering (DISC) algorithm is a largely unsupervised top-down approach for rapid idealization of noisy piecewise continuous time series typical of SM imaging experiments (13). For a given noisy time series, DISC estimates the underlying ideal noiseless time series consisting of discrete jumps between a finite number of distinct intensity levels. Both the jumps and the number of distinct intensity levels are determined in an unsupervised fashion and do not require the user to postulate a molecular mechanism before analysis. Compared with HMMs or change point analyses, DISC is orders of magnitude faster while maintaining state-of-the-art accuracy, precision, and recall. Lately, many deep learning techniques reliant on neural networks have been developed for unsupervised SM analysis (19,20). Unlike these approaches, DISC does not require extensive training data sets to guide its idealization, which simplifies its application to multiple different experimental regimes. Rather than relying on training data, DISC utilizes a user-specified objective criterion (OC) that weighs goodness of fit against the complexity of the ideal sequence to guide idealization (22). The OC represents a metric for unsupervised approaches to identify the simplest model that describes the noisy experimental data reasonably well while avoiding complex models that overfit the data. In addition to DISC, OCs have been widely applied in numerous SM analysis approaches, including HMM model selection and change point idealization (10,11,16,23).

Here, we show that different OCs optimize DISC’s accuracy, precision, and recall for SM time series that differ in their signal/noise ratio (SNR) or number of sample points. This is crucial for SM fluorescence imaging experiments in which nonuniformity in the optics or illumination and stochastic bleaching of fluorophores result in variable SNRs and observation window durations across molecules even within a single field of view. To maximize the performance of DISC on such a data set, we use a machine learning technique to automate the per-molecule selection of the optimal OC. Critically, this automation makes DISC robust to both the scale and heterogeneity of data sets typical of increasingly common high-throughput SM imaging experiments (24,25).

Materials and Methods

Simulations

All simulations and analyses were conducted in MATLAB version R2019a (The MathWorks, Natick, MA). SM time series were simulated as Markov chains of dwells in distinct molecular states governed by the average rate of transitions between states (11). The simulated Markov mechanisms are depicted in Fig. S1. Each state was assigned a mean observable intensity. For models of two-state dynamics at one, two, or four sites, state intensity levels were 0 and 1. For the three-state cyclic and linear models, state intensity levels were 0.2, 0.6, and 0.8 to reflect typical SM fluorescence resonance energy transfer (smFRET) observations. All simulations had a uniform sample frequency f_s, and transition rates between states were specified relative to the sample frequency. For the one-site, two-site, and four-site models, both forward and backward transitions rates were set to the same value, which ranged from 0.001 to 0.1 f_s. For the three-state models, the fast rate k_f ranged from 0.001 to 0.1 f_s, and the slow rate k_s was set to 0.3 k_f. The fast rate always described transitions between the higher-intensity levels. These prescriptions provide simulations that test performance on both equivalent and disparate rates within a given model.

The duration of each distinct dwell in a state was simulated with double precision. For discretized simulated intensities at a uniform sample frequency f_s, we assigned the weighted mean of the intensities for all dwells within a sample duration (1/f_s) with weights set to the relative fraction of the sample duration for each dwell. This procedure simulates integration of the signal throughout the sample period analogous to camera-based imaging strategies. This results in some samples having simulated mean intensities that are a weighted mean of the intensities of multiple individual states visited during that sample duration. For example, dwells shorter than a single sample period result in truncated observed intensity values for that sample. However, even at the fastest tested transition rate (0.1 f_s), such truncations were infrequent and had little impact on idealization with DISC. At slower transition rates, these subsample events were relatively rare. We note that at faster rates approaching the sample frequency, subsample events are frequent and result in an overall reduction in the apparent state intensity level separation (data not shown). DISC is likely to be inappropriate for such data, as it requires reasonable resolution of the state intensity levels.

When applicable, we implemented state intensity heterogeneity on a per-event basis by adding a small stochastic offset to the mean intensity of individual events. The offset was drawn from an exponential distribution with a mean set to 4% of the average separation between neighboring state intensity levels, similar to prior observations of such fluctuations in measures of molecular association (13).

To simulate noisy experimental data, Gaussian or Poisson noise was added to the noiseless intensity series described above. For Gaussian noise, the standard deviation (SD) of the added noise (σ) was set to the ratio of the average separation between neighboring state intensity levels (ΔI_avg) and the specified per-site SNR (SNR_S ranging from 1 to 8) such that SNR_S = ΔI_avg/σ. For example, ΔI_avg = 1 and σ = 1/SNR_S for two-state dynamics between intensity levels 0 and 1, whereas ΔI_avg = 0.3 and σ = 0.3/SNR_s for three-state dynamics between intensity levels 0.2, 0.6, and 0.8.

For simulations of two-state dynamics at two or four independent sites, the variance of the added noise σ² was scaled for each event by the number of sites in their higher-intensity state. The reason for this procedure is to simulate experimental observations such as binding and unbinding of a fluorescent molecule at multiple sites where the noise is observed to increase with the number of fluorophores in each diffraction-limited spot (13). The degree to which multiple sites give rise to additive noise depends on the amount of noise arising from the recording system versus the molecular activity, which will vary for each individual experimental setup. For Poisson-distributed noise, the variance naturally increases with increasing photon counts. In the low-photon regime discussed below, the variance approximately doubles upon transitioning from one to two sites adopting their high-intensity states, whereas this increase is ∼1.5-fold in the higher-photon regime explored here. As there is no single value that will describe additive noise in all experiments, we chose to scale the noise by the number of “active” or “occupied” sites to provide examples of data that differ substantially from the simulations of all single-site models with uniform noise. Given that our results are qualitatively similar for both uniform and scaled noise, it is likely that our conclusions will remain relevant for scaled noise that falls between these extremes. Finally, we also provide simulations with Poisson noise (see below), which together with uniform and scaled noise models provide a breadth of examples across variable types of simulated noise.

For Poisson-distributed noise, we first converted simulated noiseless intensity traces as described above to number of photons per sample. We assumed a baseline mean photon count p per sample duration which defines a baseline SD of σ_p = $\sqrt{p}$ . Simulated noiseless intensity traces for all tested models were scaled by the factor SNR_Sσ_p for a specified per-site SNR_S (ranging from 1 to 8) such that SNR_S = ΔI_avg/σ_p after scaling. Finally, the scaled trace was rounded to an integer number of photons, and the baseline photon count p was added. Poisson noise was then applied to each sample in the simulated photon series by drawing from a Poisson distribution with a mean set to the noiseless simulated photon count at that sample. To simulate data from experiments with lower and higher background photon counts, p was chosen to be either 10 or 50, respectively.

A consequence of nonuniform noise in simulations of multiple sites is that SNR is not constant throughout the series but decreases transiently from that specified for each single site during periods in which multiple sites simultaneously adopt their higher-intensity state. SNR is also not constant for the three-state models with uniform noise given that it is based on an average of two different level separations. To obtain a single metric to describe the overall SNR throughout a time series, we define the effective SNR of each series as the ratio of ΔI_avg to the SD of the residuals after subtracting the simulated ideal noiseless intensity series from that after adding noise. Note that this effective SNR can vary even for series simulated with the same per-site SNR_S and will, in general, be less than SNR_S for simulations of multiple sites because of additive noise.

For each model and each unique combination of SNR_S and transition rate (k or k_f), we simulated a total of 100,000 sample points split between 10 and 1000 time series depending on series lengths that ranged from 10,000 samples to 100 samples, respectively.

Idealization performance

Noisy simulated time series were idealized with DISC using one of the following OCs: Bayesian information criterion based on either the residual sum of squared errors (BIC_RSS) or a Gaussian mixture model (BIC_GMM), Akaike information criterion based on either the residual sum of squared errors (AIC_RSS) or a Gaussian mixture model (AIC_GMM), Hannan-Quinn information criterion (HQC_GMM), and minimal description length (MDL) (Eqs. 2, 3, 4, 5, 6, 7, 8, 9, 10, and 11) (23,26, 27, 28, 29). In principle, a different OC could be selected for the segmentation and clustering portions of the DISC algorithm. Here, we only explore the impact of selecting a single OC for all portions of the algorithm. The general format for each OC is a summation of two terms that balance goodness of fit with a penalty for overfitting noise with overly complex models (Eq. 1).

OC = f i t e r r o r + o v e r f i t p e n a l t y,

(1)

{BIC}_{RSS} = n_{p t s} l n (\frac{RSS}{n_{p t s}}) + (n_{t r a n s i t i o n s} + n_{s t a t e s}) l n (n_{p t s}),

(2)

{BIC}_{GMM} = - 2 l n (L) + (3 n_{s t a t e s} - 1) l n (n_{p t s}),

(3)

{AIC}_{RSS} = n_{p t s} l n (\frac{RSS}{n_{p t s}}) + 2 (n_{t r a n s i t i o n s} + n_{s t a t e s}),

(4)

{AIC}_{GMM} = - 2 l n (L) + 2 (3 n_{s t a t e s} - 1),

(5)

and

{HQC}_{GMM} = - 2 l n (L) + 2 (3 n_{s t a t e s} - 1) l n (l n (n_{p t s})),

(6)

where $L$ is the likelihood for the estimated model defined as the product of likelihoods for each data point y(t_i), each of which is described by a linear combination of likelihoods for each state’s Gaussian emission distribution $N$ with mean and SD μ_j and σ_j and mixing coefficient w_j ((7), (8)) (30).

L = \prod_{i = 1}^{n_{p t s}} \sum_{j = 1}^{n_{s t a t e s}} w_{j} N (y (t_{i}) | μ_{j}, σ_{j})

(7)

N (y | μ, σ) = \frac{1}{σ \sqrt{2 π}} e x p (\frac{- {(y - μ)}^{2}}{σ^{2}})

(8)

We define MDL as previously described in (23):

MDL = F + G,

(9)

F = \frac{\sum_{i = 1}^{n_{p t s}} | y (t_{i}) - y_{f i t} (t_{i}) |}{2 σ},

(10)

and

G = \frac{n_{s t a t e s}}{2} l n (\frac{1}{2 π}) + (n_{s t a t e s}) l n (\frac{y_{m a x} - y_{m i n}}{σ}) + \frac{n_{t r a n s i t i o n s}}{2} l n (n_{p t s}) + \frac{1}{2} [\sum_{i = 1}^{n_{s t a t e s}} l n (n_{p t s i n s t a t e i}) + \sum_{j = 1}^{n_{t r a n s i t i o n s}} \ln (\frac{{(Δ y_{t r a n s i t i o n j})}^{2}}{σ^{2}})] .

(11)

To quantify the quality of DISC’s idealizations with different OCs, each event returned by DISC was determined to be either a true positive (TP), false positive (FP), or false negative (FN) based on the known simulated noiseless sequence. These same metrics were also used to compare the idealization of AutoDISC to two other unsupervised idealization approaches: step transition and state identification (STaSI) and AutoStepfinder (23,31). Because events were simulated with subsample timing and intensities integrated over the sample period, samples containing transitions between states exhibit intermediate intensities between the true state intensity levels. As these intermediate intensities are an artifact of the discretization process rather than true state intensities, we set such intermediates in the noiseless sequence to the intensity level associated with the state that was occupied for the largest portion of the sample period before determining TP, FP, and FN events. Furthermore, to prevent slight numerical differences between simulated and idealized intensity levels or subtle offsets in event onset or offset from being construed as errors, TP events for OC comparisons were allowed to have intensities within ±10% of the known SD of the intensity level and onset or offset times within ±3 samples of the known event timing. For all software comparisons, this intensity envelope was expanded to ±25% of the known SD of the intensity level to limit excessive error attribution to STaSI or AutoStepfinder so long as they are reasonably close to the true intensity levels. Events classified as FPs were either extraneous events or correct events with the wrong intensity. FNs were defined as missed events. For each idealization, accuracy, precision, and recall were calculated as general performance metrics ranging from 0 (worst) to 1 (best) (Eqs. 12, 13, and 14). The F1 score is a widely used overall metric for summarizing performance and ranges from 0 (worst) to 1 (best) (Eq. 15) (32).

Accuracy = \frac{TP}{TP + FP + FN}

(12)

Precision = \frac{TP}{TP + FP}

(13)

Recall = \frac{TP}{TP + FN}

(14)

F 1 = \frac{2 \times Precision \times Recall}{Precision + Recall}

(15)

OC penalty hyperparameter

To test the effects of a variable penalty term, we scaled the penalty terms of AIC_GMM, BIC_RSS, and MDL using a hyperparameter λ (see Eq. 16). For each unique combination of conditions (SNR_S and transition rate), short (300 samples) or long (3000 samples) simulated traces from the four-site and three-state cyclic models with added state intensity heterogeneity were idealized with DISC using AIC_GMM, BIC_RSS, or MDL for values of λ ranging from 0.001 to 100.

OC = f i t e r r o r + λ (o v e r f i t p e n a l t y)

(16)

Linear decision boundary for AIC_GMM vs. BIC_RSS

To generate a decision boundary for the optimal choice of either AIC_GMM or BIC_RSS based on the properties of the time series, we used MATLAB’s fitcsvm function from the Statistics and Machine Learning Toolbox (The MathWorks). For each data series from each tested model, we computed a preference score for AIC_GMM vs. BIC_RSS based on their individual F1 scores (Eq. 17).

Preference for {AIC}_{GMM} = \frac{F 1_{GMM}}{F 1_{GMM} + F 1_{RSS}}

(17)

For each model, these scores were binned in two dimensions according to the effective SNR and log₁₀(number of samples) of their corresponding series. Each bin was labeled as either “AIC_GMM optimal” or “BIC_RSS optimal” according to its average preference score (AIC_GMM optimal for preference scores ≥0.5). These bin labels were used to train a support vector machine (SVM) classifier to determine a linear decision boundary in the two-dimensional space of effective SNR and number of samples.

Data availability statement

Software implementing AutoDISC is available at https://github.com/marcel-goldschen-ohm/AutoDISC.

Results

Different OCs optimize DISC’s performance under different experimental conditions

The DISC algorithm is a largely unsupervised top-down approach for rapid idealization of noisy piecewise continuous time series typical of SM imaging experiments (13). Although subsequently unsupervised, the algorithm initially requires a user-selected OC to guide its idealization. DISC has three main steps. 1) Divisive segmentation: starting with the intensity values of all data points in the time series assigned to a single cluster (the mean intensity of the time series), each cluster is recursively split into two child clusters until the selected OC is optimized. 2) Hierarchical agglomerative clustering: during segmentation, clusters in separate branches that should belong to the same intensity level may be assigned unique intensity states because of random fluctuations in the data. Agglomerative clustering starting with the segmented clusters from the previous step remerges segments with similar intensity distributions based on the selected OC. 3) Viterbi: the previous steps accurately identify intensity levels in the time series but do not provide a good description of the kinetics of transitions among those levels. To estimate event kinetics, the Viterbi algorithm is applied to determine the most likely sequence of transitions among the identified intensity levels. We have previously shown that DISC provides orders of magnitude faster computational speed while maintaining comparable accuracy, precision, and recall to other commonly used approaches (13). However, the impact of OC choice on the accuracy of DISC’s idealization for different kinds of data has yet to be rigorously evaluated.

To explore the impact of OC choice on DISC’s performance, we simulated noisy SM time series with known state sequences and evaluated the ability of DISC to identify the correct noiseless sequences using different OCs. We simulated data for several different kinds of mechanisms to evaluate a variety of typical SM data. Simulated mechanisms include two-state dynamics at one, two, or four independent sites similar to binding observations from colocalization SM fluorescence experiments and three-state linear or cyclic models with distinct state emissions typical of smFRET experiments (Fig. S1). Furthermore, for each model we varied simulation parameters such as sample length, SNR, and state transition rates to determine the impact of OC choice on data under different experimental conditions (see Materials and methods). Here, we define SNR as the ratio of the average intensity level separation (ΔI_avg) to the SD of the noise fluctuations (σ) (see Materials and methods). In many cases, real experimental SM fluorescence data additionally contain heterogeneity in state emissions (13,33). The source of this heterogeneity is uncertain but is likely caused by shifts of the molecule in the exponentially decaying excitation field or dye photodynamics (14,15). Changes in observed dye brightness due to dye conformation, polarization orientation, partial quenching via electron transfer, and protein-induced fluorescence enhancement are commonly observed (34). To better reflect these real observations, we additionally included heterogeneous state intensities in some of our simulations (see Materials and methods).

Five OCs were initially tested: Bayesian information criterion based on either the residual sum of squared errors (BIC_RSS) or a Gaussian mixture model (BIC_GMM), Akaike information criterion (AIC_GMM) and Hannan-Quinn information criterion (HQC_GMM) based on a Gaussian mixture model, and MDL (Eqs. 2, 3, 5, 6, 7, 8, 9, 10, and 11) (23,26, 27, 28, 29). In each case, DISC’s idealization performance on the noisy simulated data was evaluated using standard criteria of accuracy, precision, and recall (Eqs. 11, 12, and 13). An overall measure of this performance is summarized in the F1 score (0–1: worst to perfect), which combines both precision and recall in a single metric (Eq. 15) (32). For each OC, DISC’s performance was primarily dependent on the SNR and number of samples in the time series (Figs. S2–S14). State transition rates had relatively less of an effect on performance except when rates were extreme (e.g., average rate approaching the sample rate).

For data with heterogeneous state intensities, there was no single OC that was optimal for all simulated conditions (Figs. 1 and S2–S6). Thus, the conditions of each time series dictate the optimal choice of OC. Across tested models, BIC_RSS is almost always the best choice for data with few sample points and low SNR. In some cases, MDL performs as well as BIC_RSS in this regime. However, MDL’s relative performance decreases substantially for faster transition rates, especially as the underlying model complexity increases or the SNR decreases. In contrast, BIC_RSS exhibits either a smaller or negligible drop in relative performance at these faster rates. At the lowest SNRs, there is a trend for performance to decrease with increasing trace length and transition rates. In extreme cases, this occurs because of the idealization converging on a constant value that is the mean of the data. For slower rates and shorter traces, this is more likely to be true than for longer traces with faster transition rates, at which increasing trace length increases the odds of missing an actual transition because of poor SNR. Generally, the GMM-based OCs including BIC_GMM, AIC_GMM, and HQC_GMM tend to outperform BIC_RSS as the number of sample points and/or the SNR increases. For shorter time series and lower SNRs, AIC_GMM outperforms both BIC_GMM and HQC_GMM. However, BIC_RSS is better yet for these series. At higher SNRs at which BIC_RSS performs relatively poorly, all the GMM-based OCs perform similarly well. Thus, a simple choice between BIC_RSS and one of the GMM-based OCs would provide an optimal solution in nearly all tested cases. Given that AIC_GMM has the overall best performance across tested data conditions, the remainder of our analysis focuses on AIC_GMM and BIC_RSS.

Optimal choice of AIC_GMM vs. BIC_RSS depends on experimental conditions. (*Top* and *middle*) Examples of simulated SM time series for the four-site model in Fig. S1 with (*gray*) and without (*black*) added noise and per-event state intensity heterogeneity. Traces are overlaid with DISC’s idealization of the noisy data using either AIC_GMM (*blue*) or BIC_RSS (*orange*). Histograms of the noisy data are shown to the right overlaid with mixtures of Gaussians fitted to the data in each uniquely identified level for both the true noiseless series and the result of each idealization. AIC_GMM tends to underfit shorter series with lower per-site signal-to-noise ratios (SNR_S) (*top*, SNR_S = 3), whereas BIC_RSS tends to overfit longer series with higher SNR_S (*middle*, SNR_S = 6; notice that true levels are split into multiple nearby sublevels). Note that noise increases with observed intensity level such that the effective signal/noise for events with multiple occupied sites will be less than SNR_S (see Materials and methods). (*Bottom*) Summary of performance for DISC’s idealization of simulated noisy SM time series across a range of series lengths and SNR_S for k = 0.005 f_s. Rate × 10 implies k = 0.05 f_s. Mean (*line*) and SD (*shaded region*) for F1 scores (0–1: worst to perfect; Eq. 15) for 10–1000 simulated time series at each unique set of conditions (number of samples and SNR_S) idealized with DISC using AIC_GMM, BIC_RSS, BIC_GMM, HQC_GMM, or MDL (see Materials and methods). See Figs. S2–S18 for additional conditions and models. To see the figure in color, go online.

In general, BIC_RSS outperforms AIC_GMM for short traces with less than ∼1000 samples and low SNRs of ∼3 or less (Fig. 1), whereas AIC_GMM outperforms BIC_RSS for longer traces with more samples and/or higher SNR (Fig. 1). This is largely because AIC_GMM tends to underfit shorter series with low SNR, whereas BIC_RSS overfits longer series with high SNR. Each OC balances a goodness of fit term and a penalty term that attempts to prevent overfitting for overly complex models or sequences (Eq. 1). The observed behavior can be understood by examination of the penalty terms for AIC_GMM and BIC_RSS ((2), (5)). Because the number of level transitions or change points is a stochastic function of the length of the time series, BIC_RSS will generally have a lower penalty than AIC_GMM for shorter traces with few transitions. For short traces with a low SNR, AIC_GMM is likely to underfit the number of distinct intensity levels because of the small amount of data and relatively large variation around each level, whereas the relatively smaller penalty for adding a level with BIC_RSS means that such underfitting is less likely (Fig. 1, top). However, the lower penalty for additional levels also means that BIC_RSS tends to split levels with heterogeneous event intensities into multiple sublevels because of a marginal increase in transitions per sublevel compared to a significant reduction in the sum of squared residuals. The amount of heterogeneity in each level will increase with the number of transitions into each level, which increases with the length of the series. Also, as SNR increases, such heterogeneity becomes more distinct. Thus, for longer traces with a high SNR, heterogeneous event intensities are likely to be overfitted by BIC_RSS in comparison to AIC_GMM (Fig. 1, middle). In the absence of heterogeneous state intensities, BIC_RSS either matches or outperforms AIC_GMM for most experimental conditions (Figs. S7 and S8). However, AIC_GMM continues to outperform BIC_RSS in a few cases for longer traces with high SNR and rapid transition rates (Fig. S7). This, again, can be understood by examination of the penalty terms for each OC ((2), (5)). At faster rates, long traces will have large numbers of change points that result in a larger penalty term for BIC_RSS than for AIC_GMM. Because of this larger penalty, BIC_RSS is more prone to underfit these traces than AIC_GMM.

These trends in OC performance were consistent for both Gaussian and photon-based Poisson noise, suggesting that DISC can be applied to data with both types of noise despite the GMM-based OCs assuming Gaussian distributed noise (see Materials and methods) (Figs. S9–S16). As compared to simulations with Gaussian noise, the upper limit of performance for simulations with Poisson noise was a bit lower because of larger noise envelopes in high-intensity states. This effect was exacerbated in simulations with low photon counts.

The relative success of AIC_GMM compared to BIC_GMM suggested that an evaluation of AIC_RSS (Eq. 4) was necessary to compare to BIC_RSS. For two of the tested models, AIC_RSS performs consistently worse than BIC_RSS across the evaluated conditions (Figs. S17 and S18). AIC_RSS tends to overfit in comparison to BIC_RSS because of its relatively smaller penalty term for added states and/or transitions. For longer series with higher SNRs in which BIC_RSS already tends to overfit, AIC_RSS only exacerbates the overfitting. Therefore, AIC_RSS was not included in further analyses. Analogously, AIC_GMM performs better than BIC_GMM because it is less prone to underfitting because of its smaller penalty term. We note, however, that there was a slight preference for BIC_GMM over AIC_GMM at the highest SNRs and rates for some models. In this regime, overfitting noise fluctuations and rapid flickery transitions is mitigated by the higher penalty term for BIC_GMM. Thus, for data at higher SNRs and rates than tested here, it is possible that BIC_GMM would be a more optimal choice. Given that most fluorescence-based SM series fall in the range of tested conditions, we focus on AIC_GMM.

Optimal idealization with DISC across data conditions requires selection between at least two OCs such as AIC_GMM or BIC_RSS based on the number of samples and SNR in the time series. For SM fluorescence measurements in which fluorophore bleaching and spatial or temporal heterogeneity in excitation power can lead to stochastic variability in both the number of samples and SNR across molecules, this choice should optimally be made on a per-molecule basis.

Optimization of a variable penalty hyperparameter does not outperform a simple choice between either AIC_GMM or BIC_RSS

Given the dependence of DISC’s performance on the penalty term of the chosen OC, we explored whether optimizing a variable penalty term for a given OC would further enhance performance. We introduced a hyperparameter λ that scales the penalty term (Eq. 16). Using AIC_GMM, BIC_RSS, or MDL as the framework for the OC, we scaled λ from 0.001 to 100 and chose the value of λ that maximized the OC’s performance for a given set of experimental conditions (see Materials and methods) (Figs. S19–S22). Notably, all the GMM-based OCs differ solely in their scaling of the penalty term, meaning λ evaluation of AIC_GMM is inclusive of similar evaluations with BIC_GMM or HQC_GMM. This markedly improved performance for each OC in their respective problematic regimes. However, in nearly all conditions, AIC_GMM, BIC_RSS, or MDL optimized with this approach did not outperform the better of either AIC_GMM or BIC_RSS with λ = 1. Although it is theoretically possible to fit a curve for a given OC’s optimal λ at every experimental condition to guarantee peak performance, the optimal λ is relatively model dependent (Figs. S19–S22). This curve-fitting approach would therefore overfit the underlying training models, limiting its utility for many SM experiments. In contrast, a simple selection between AIC_GMM or BIC_RSS is practical and generally sufficient for optimal performance on a per-molecule basis.

A decision boundary for selecting the optimal OC on a per-molecule basis

To automate the unsupervised optimal choice of either AIC_GMM or BIC_RSS on a per-molecule basis, we used a machine learning tool, the SVM, to identify a two-dimensional linear decision boundary based on number of samples and the effective SNR in each time series (see Materials and methods) (Fig. 2). For each unique pair of conditions (number of samples and effective SNR), Fig. 2 illustrates the relative preference for AIC_GMM over BIC_RSS based on F1 score (Eq. 17). The boundary between regimes of high BIC_RSS-preference and high AIC_GMM-preference is well described by a line, and more complex nonlinear boundaries were unnecessary. Constraining the boundary to a linear SVM also prevented overfitting small fluctuations in the training data set.

A linear decision boundary for the optimal choice of either AIC_GMM or BIC_RSS. Heatmap of the degree of preference for AIC_GMM vs. BIC_RSS from 0 to 1 (see Eq. 17) is shown for simulations of the four-site binding model shown in Fig. S1 with state intensity heterogeneity. Preference was determined from 1000 simulated time series for each unique pair of conditions (SNR_S and number of samples in the time series). Given nonuniform noise across intensity levels, preference is shown as a function of the effective SNR of each trace (see Materials and methods) rather than the average simulated SNR_S. The white line denotes the linear decision boundary determined by the SVM classifier. Boundaries for additional models with and without state intensity heterogeneity are shown in Figs. S23 and S24. To see the figure in color, go online.

A similar linear decision boundary was determined for each model in Fig. S1 for simulations both with and without per-event state intensity heterogeneity (Figs. S23 and S24). The boundaries for each model were largely similar, with the primary difference being a slight shift to lower SNR for the simpler two-state models at one or two sites. Nonetheless, the same boundary determined for the more complex four-site and three-state models was also appropriate for the simpler one- and two-site models given the broad region of roughly equivalent preference for AIC_GMM or BIC_RSS in the simpler models. Here, we selected the boundary for the four-site binding model as appropriate for all tested models. This boundary is given by log₁₀(#samples) = −0.49SNR_effective + 4.69. For a given model, the determined boundary was somewhat dependent on transition rate, with small shifts in intercept or slope across rates (Fig. S25). However, the overall cluster of boundaries across rates was highly consistent with each model’s overall boundary. Although boundaries were highly similar for data both with and without per-event state intensity heterogeneity, BIC_RSS was generally either optimal or equivalent to AIC_GMM across tested conditions in the absence of heterogeneity (Figs. S23 and S24). Thus, the need for an automated per-molecule selection of OC is much clearer when the data include such heterogeneity. Nonetheless, use of the identified boundary does not harm idealization when no per-event heterogeneity is observed. We cannot rule out that some mechanisms may give rise to time series with very different decision boundaries. However, the set of models and range of simulated conditions covers representative data for typical SM fluorescence experiments. Thus, the identified boundary is likely to be relevant for many SM imaging data sets or similar data.

Estimation of effective SNR for SM time series

To use this decision boundary, one must estimate the average separation in intensity levels (ΔI_avg) and the overall noise (σ) of each time series in an experimental data set. Therefore, we developed an unsupervised approach to estimate the effective SNR of an SM time series (Fig. 3). First, we apply DISC using BIC_RSS to generate an initial idealization that may overfit, but likely does not underfit, the data (Fig. 3, top left). The SD of the residuals between this initial fit and the noisy data is taken as our overall noise estimate σ (Fig. 3, top right). To estimate ΔI_avg, we first find the absolute value of the change in intensity levels at each change point i in the idealized series (ΔI_i). We then generate an array containing n_i copies of each step ΔI_i, where n_i is the number of samples between changepoint i − 1 and i + 1. This procedure ensures that changes in intensity are weighted according to the relative fraction of the series that they define. To avoid inclusion of small intensity changes due to BIC_RSS overfitting noise fluctuations, we only consider change points at which ΔI_i > 2σ. The mean of the resulting array of ΔI_i-values is taken as our signal estimate ΔI_avg (Fig. 3, bottom left). Finally, the effective SNR is estimated as the ratio ΔI_avg/σ.

Estimation of the average intensity level separation and noise in a time series. (*Top left*) Idealization of a noisy time series using DISC with BIC_RSS. (*Top right*) Histogram of the residuals after subtracting the idealization from the noisy time series to the left. The SD σ of the distribution of residuals is taken as the estimate for the average noise in the observations. (*Bottom left*) Histogram of step heights from the idealized series above in which each step is duplicated according to the number of samples in the preceding and following dwells (see Results). The mean of the histogram (*solid line*) after discarding step heights less than 2σ (*dashed line*) is taken as the estimate for the average intensity level separation in the time series (ΔI_avg). Estimated SNR = ΔI_avg/σ. (*Bottom right*) Estimated versus effective SNR (see Materials and methods) summarized across simulated time series for all tested models in Fig. S1 under all tested conditions (SNR_S, number of sample points per trace, and average transition rates). Mean (solid line) and SD (shaded region) across time series. Dashed diagonal line indicated perfect estimation.

We tested the SNR estimation approach described above by comparing it to the known effective SNR for each simulated series for each of the models in Fig. S1 with and without state intensity heterogeneity. In all cases, this estimation procedure provided a good approximation of the known effective SNR (Fig. 3, bottom right). SNR tended to be slightly overestimated for traces with low effective SNRs (around 2) and underestimated for traces with high effective SNRs (around 6–7). However, over- and underestimation was marginal and is unlikely to have a major impact on choice of OC for the identified decision boundary.

AutoDISC: a completely unsupervised workflow for optimal per-molecule performance

With the linear decision boundary for optimal choice of either AIC_GMM or BIC_RSS and the SNR estimation approach described above, we can now establish a completely unsupervised workflow, AutoDISC, for optimizing the DISC algorithm on a per-molecule basis (Fig. 4). The workflow for an individual time series is as follows: 1) apply DISC with BIC_RSS to idealize the time series. 2) Estimate the effective SNR of the series. 3) Based on the number of samples and the estimated effective SNR, select the optimal OC based on the linear decision boundary between AIC_GMM and BIC_RSS. 4) If AIC_GMM is optimal, apply DISC with AIC_GMM to idealize the time series. Otherwise, use the initial idealization. This workflow allows unsupervised per-molecule idealization of noisy SM data with stochastic variation in series duration, SNR, and event intensities typical of many SM fluorescence imaging data sets.

Workflow for unsupervised per-molecule optimal idealization by DISC.

Comparison to other unsupervised idealization methods

AutoDISC was benchmarked by comparison to the unsupervised idealization methods STaSI and AutoStepfinder (Fig. 5) (23,31). Idealization of simulated series for two-state dynamics at one or four sites and three-state cyclic dynamics was evaluated using either AutoDISC, STaSI, or AutoStepfinder across a range of conditions (SNR_S, number of samples per trace, and transition rates) according to F1 score. AutoDISC either outperformed or was equivalent to all other methods across models and nearly all conditions tested both with and without per-event state intensity heterogeneity (Figs. 5 and S26–S31).

Comparison of AutoDISC, STaSI, and AutoStepfinder. (*Top* and *middle*) Examples of simulated SM time series for the three-state cyclic model (*top*; SNR_S = 7, k_f = 0.01 f_s) and four-site model (*middle*; SNR_S = 6, k = 0.05 f_s) shown in Fig. S1 with (*gray*) and without (*black*) added noise and per-event state intensity heterogeneity. Simulated traces are overlaid with idealizations from AutoDISC (*blue*), STaSI (*yellow*), and AutoStepfinder (*orange*) (23,31). Histograms of the noisy data are shown to the right of each series overlaid with mixtures of Gaussians fitted to the data in each uniquely identified level for both the true noiseless series and the idealizations. (*Bottom*) Summary of each method’s idealization performance for varying series length, SNR_S, and rate constant k using the four-site model. Mean (*line*) and SD (*shaded region*) for F1 scores (0–1: worst to perfect; Eq. 15) for 10–1000 simulated time series at each unique set of conditions (number of samples and SNR_S) are given. See Figs. S26–S31 for additional conditions and models. To see the figure in color, go online.

The best competitor, STaSI, had similar or even in a few cases slightly better F1 scores to AutoDISC for shorter series with low SNR_S and slow transition rates but was much less favorable for longer traces with higher SNR_S and faster transition rates, consistent with previous observations based on BIC_GMM (13). This performance can be understood based on two observations. First, STaSI tends to miss many short-lived events, which is exacerbated under conditions of faster dynamics. In cases in which all sojourns in an intensity level are brief, STaSI may ignore the level completely (Fig. 5, middle). Second, in the presence of per-event state intensity heterogeneity, STaSI assigns numerous unique levels for each of these events rather than a single global level as correctly identified by AutoDISC (Fig. 5, top). This results in overly complex idealized series with extra states and spurious intrastate transitions that challenge automated screening or analysis.

AutoStepfinder performed similarly to AutoDISC and STaSI for shorter series with slow dynamics. However, F1 scores for AutoStepfinder were consistently lower than those for either STaSI or AutoDISC for nearly all other conditions tested. The relatively poorer performance of AutoStepfinder was primarily the result of overfitting noise fluctuations, thereby resulting in the misidentification of short-lived dwells in nonexistent states (Fig. 5, top and middle). Such overfitting occurred in both the presence and absence of per-event state intensity heterogeneity, suggesting that its penalty for additional states is often too lenient. In the presence of per-event state intensity heterogeneity, AutoStepfinder behaves similarly to STaSI in that it assigns numerous unique levels for each event rather than identifying the true single global level.

AutoDISC performs well at faster transition rates regardless of heterogeneity for more complex models with F1 scores substantially larger than the next-best tested method as series length increases. By optimally selecting the appropriate OC, equivalent or better performance is also achieved under most other conditions. Furthermore, AutoDISC is the only tested method that is consistently robust to per-event state intensity heterogeneity. Given the presence of such features in many SM fluorescence experiments, AutoDISC is an ideal choice for automated analysis or screening of these data sets.

Discussion

The performance of the DISC algorithm with each of six different OCs (BIC_GMM, BIC_RSS, AIC_GMM, AIC_RSS, HQC_GMM, and MDL) was investigated for simulations reflecting typical SM fluorescence time series observations under variable experimental conditions. Each OC differentially impacts DISC’s performance for different conditions, with the SNR and number of samples in a time series being the primary determinants for the optimal OC. For nearly all conditions tested, the OC with the best F1 score was either BIC_RSS or one of the GMM-based OCs such as AIC_GMM. BIC_RSS is typically optimal for time series with few samples and/or a low SNR, whereas AIC_GMM is typically optimal for time series with a thousand or more samples and high SNR. This difference in behavior can largely be attributed to the different penalty terms for each OC and the GMM likelihood distributions. However, optimization of a variable penalty term does not generally outperform either AIC_GMM or BIC_RSS, suggesting that a simple choice between the two OCs is sufficient for optimal performance by DISC in nearly all cases. Examination of the optimal OC under variable conditions (i.e., SNR and number of samples) reveals a general linear decision boundary that can be used to select the optimal OC for a given set of conditions. We further developed an estimator for the effective SNR of experimental SM time series that relates intensity level separation and Gaussian noise. Together, these developments establish AutoDISC: a completely unsupervised workflow for the DISC algorithm that automates selection of the optimal OC on a per-molecule basis. AutoDISC performs favorably compared to other contemporary unsupervised idealization approaches on time series typical of SM fluorescence experiments.

Fluorescence imaging is a common approach for massively parallel observations of SM time series (1,3,6,15). Critically, effective SNRs in SM fluorescence data sets are often near our derived decision boundary, at which optimal choice of OC is most impactful. Furthermore, stochastic fluorophore bleaching and nonuniformities in the optical pathways result in per-molecule variation in properties such as the observation time window and the ratio of the fluorescence intensity separation between distinct states to the noise within each state (24,25). In addition to uncertainty regarding the underlying mechanism, per-molecule variation in state intensity levels challenges analysis with approaches such as HMMs having user-defined state intensity distributions. Furthermore, for such large data sets, it is often the case that only a subset of molecules exhibits the behavior of interest. It is therefore ideal to avoid computationally intensive analyses such as HMMs on the potentially large fraction of nonrelevant molecules. Rapid, unsupervised approaches are thus ideal in many cases, either as a complete analysis or for prescreening to identify a subset of molecules for further analysis.

DISC is a recently developed, largely unsupervised approach for idealization of noisy SM time series that automates detection of the intensity levels within each individual time series without requiring postulation of a specific mechanistic model (13). When possible, the additional constraint imposed by global optimization of a specific molecular mechanism or a singular set of state intensity distributions should be preferred. However, this approach is not appropriate for unknown mechanisms or experimental data with per-molecule variability in state intensity emissions. DISC not only handles these data efficiently but is also robust to within-state intensity fluctuations that can arise from changing orientation in polarized and/or exponentially decaying excitation fields or dye photodynamics (14,15,34). Furthermore, DISC is orders of magnitude faster than HMM methods while maintaining state-of-the-art accuracy, precision, and recall (13). However, DISC relies on a user-specified OC to guide its idealization. Here, we show that the optimal choice of OC depends primarily on both the number of sample points and the signal and noise properties in each time series. Thus, maximizing the performance of DISC on experimental data sets with variability in these parameters requires selection of the optimal OC on a per-molecule basis.

By automating the per-molecule choice of OC, we develop a fully unsupervised workflow, AutoDISC, that maximizes DISC’s performance across data sets with stochastic variation in observation conditions. AutoDISC either outperforms or matches two other unsupervised idealization methods (STaSI and AutoStepfinder) across nearly all conditions tested. One of the advantages of AutoDISC over these other methods is its ability to robustly identify global intensity levels in the presence of per-event intensity heterogeneity. For data that include such heterogeneity, this is highly beneficial for automated identification and analysis of the dynamics of interest. However, if one’s goal is to identify the observed local intensity of each event rather than the global state level, then an alternate method such as STaSI or AutoStepfinder may be preferred. We also note that we only evaluated data with up to five unique states similar to smFRET experiments, whereas some observations such as stepwise motion of molecular motors or extension in molecular tweezers can give rise to many more levels (35,36). In principle, DISC should continue to perform well with many additional levels, provided they are reasonably well separated. Another advantage of AutoDISC is its relatively better performance for series with faster dynamics, data that have historically been reliant on computationally expensive approaches such as HMMs (37,38).

As compared to DISC, AutoDISC’s computational cost is either marginal or at most doubled owing to potentially rerunning DISC with AIC_GMM. Thus, this workflow does not negate the benefit of DISC’s computational speed (13), making it an attractive approach for large high-throughput data sets. Given the prevalence of per-molecule stochastic variation in SM fluorescence observations and effective SNRs that are often near our derived decision boundary, AutoDISC provides an immediately useful tool to both optimize and speed exploration and analysis of these and similar SM data sets.

Author contributions

A.B. performed all simulations and analysis and contributed to writing the manuscript. M.P.G.-O. conceived, designed, and oversaw the project and contributed to writing of the manuscript.

Editor: Vasanthi Jayaraman.

Footnotes

Supporting material can be found online at https://doi.org/10.1016/j.bpj.2021.08.045.

Supporting material

Document S1. Figs. S1– S31

mmc1.pdf^{(40.5MB, pdf)}

Data S1.

mmc2.zip^{(17.3KB, zip)}

Document S2. Article plus supporting material

mmc3.pdf^{(42.5MB, pdf)}

References

1.Chen J., Dalal R.V., Puglisi J.D. High-throughput platform for real-time monitoring of biological processes by multicolor single-molecule fluorescence. Proc. Natl. Acad. Sci. USA. 2014;111:664–669. doi: 10.1073/pnas.1315735111. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Soltani M., Lin J., Wang M.D. Nanophotonic trapping for precise manipulation of biomolecular arrays. Nat. Nanotechnol. 2014;9:448–452. doi: 10.1038/nnano.2014.79. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Yan R., Moon S., Xu K. Spectrally resolved and functional super-resolution microscopy via ultrahigh-throughput single-molecule spectroscopy. Acc. Chem. Res. 2018;51:697–705. doi: 10.1021/acs.accounts.7b00545. [DOI] [PubMed] [Google Scholar]
4.Altman R.B., Terry D.S., Blanchard S.C. Cyanine fluorophore derivatives with enhanced photostability. Nat. Methods. 2011;9:68–71. doi: 10.1038/nmeth.1774. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Halabi E.A., Pinotsi D., Rivera-Fuentes P. Photoregulated fluxional fluorophores for live-cell super-resolution microscopy with no apparent photobleaching. Nat. Commun. 2019;10:1232. doi: 10.1038/s41467-019-09217-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Juette M.F., Terry D.S., Blanchard S.C. Single-molecule imaging of non-equilibrium molecular ensembles on the millisecond timescale. Nat. Methods. 2016;13:341–344. doi: 10.1038/nmeth.3769. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Miller H., Zhou Z., Leake M.C. Single-molecule techniques in biophysics: a review of the progress in methods and applications. Rep. Prog. Phys. 2018;81:024601. doi: 10.1088/1361-6633/aa8a02. [DOI] [PubMed] [Google Scholar]
8.Zhou X., Wong S.T.C. Informatics challenges of high-throughput microscopy. IEEE Signal Process. Mag. 2006;23:63–72. [Google Scholar]
9.Blanco M., Walter N.G. Analysis of complex single-molecule FRET time trajectories. Methods Enzymol. 2010;472:153–178. doi: 10.1016/S0076-6879(10)72011-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Bronson J.E., Fei J., Wiggins C.H. Learning rates and states from biophysical time series: a Bayesian approach to model selection and single-molecule FRET data. Biophys. J. 2009;97:3196–3205. doi: 10.1016/j.bpj.2009.09.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.McKinney S.A., Joo C., Ha T. Analysis of single-molecule FRET trajectories using hidden Markov modeling. Biophys. J. 2006;91:1941–1951. doi: 10.1529/biophysj.106.082487. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Blanco M.R., Johnson-Buck A.E., Walter N.G. In: Encyclopedia of Biophysics. Roberts G.C.K., editor. Springer; 2013. Hidden Markov modeling in single-molecule biophysics; pp. 971–975. [Google Scholar]
13.White D.S., Goldschen-Ohm M.P., Chanda B. Top-down machine learning approach for high-throughput single-molecule analysis. eLife. 2020;9:e53357. doi: 10.7554/eLife.53357. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Dempsey G.T., Bates M., Zhuang X. Photoswitching mechanism of cyanine dyes. J. Am. Chem. Soc. 2009;131:18192–18193. doi: 10.1021/ja904588g. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Levene M.J., Korlach J., Webb W.W. Zero-mode waveguides for single-molecule analysis at high concentrations. Science. 2003;299:682–686. doi: 10.1126/science.1079700. [DOI] [PubMed] [Google Scholar]
16.Greenfeld M., Pavlichin D.S., Herschlag D. Single Molecule Analysis Research Tool (SMART): an integrated approach for analyzing single molecule data. PLoS One. 2012;7:e30024. doi: 10.1371/journal.pone.0030024. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Hines K.E., Bankston J.R., Aldrich R.W. Analyzing single-molecule time series via nonparametric Bayesian inference. Biophys. J. 2015;108:540–556. doi: 10.1016/j.bpj.2014.12.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Sgouralis I., Pressé S. An introduction to infinite HMMs for single-molecule data analysis. Biophys. J. 2017;112:2021–2029. doi: 10.1016/j.bpj.2017.04.027. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Celik N., O’Brien F., Barrett-Jolley R. Deep-channel uses deep neural networks to detect single-molecule events from patch-clamp data. Commun. Biol. 2020;3:3. doi: 10.1038/s42003-019-0729-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Li J., Zhang L., Walter N.G. Automatic classification and segmentation of single-molecule fluorescence time traces with deep learning. Nat. Commun. 2020;11:5833. doi: 10.1038/s41467-020-19673-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Xu J., Qin G., Fang X. Automated stoichiometry analysis of single-molecule fluorescence imaging traces via deep learning. J. Am. Chem. Soc. 2019;141:6976–6985. doi: 10.1021/jacs.9b00688. [DOI] [PubMed] [Google Scholar]
22.Kadane J.B., Lazar N.A. Methods and criteria for model selection. J. Am. Stat. Assoc. 2004;99:279–290. [Google Scholar]
23.Shuang B., Cooper D., Landes C.F. Fast step transition and state identification (STaSI) for discrete single-molecule data analysis. J. Phys. Chem. Lett. 2014;5:3157–3161. doi: 10.1021/jz501435p. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Holden S.J., Uphoff S., Kapanidis A.N. Defining the limits of single-molecule FRET resolution in TIRF microscopy. Biophys. J. 2010;99:3102–3111. doi: 10.1016/j.bpj.2010.09.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Van Oostveldt P., Verhaegen F., Messens K. Heterogeneous photobleaching in confocal microscopy caused by differences in refractive index and excitation mode. Cytometry. 1998;32:137–146. [PubMed] [Google Scholar]
26.Akaike H. A new look at the statistical model identification. IEEE Trans. Automat. Contr. 1974;19:716–723. [Google Scholar]
27.Hannan E.J., Quinn B.G. The determination of the order of an autoregression. J. R. Stat. Soc. B. 1979;41:190–195. [Google Scholar]
28.Schwarz G. Estimating the dimension of a model. Ann. Stat. 1978;6:461–464. [Google Scholar]
29.Priestley M.B. Elsevier; London: 2004. Spectral Analysis and Time Series. [Google Scholar]
30.Bishop C. Springer; New York: 2006. Pattern Recognition and Machine Learning. [Google Scholar]
31.Loeff L., Kerssemakers J.W.J., Dekker C. AutoStepfinder: a fast and automated step detection method for single-molecule analysis. Patterns (N Y) 2021;2:100256. doi: 10.1016/j.patter.2021.100256. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Van Rijsbergen C.J. Second Edition. Butterworths; London: 1979. Information Retrieval. [Google Scholar]
33.Goldschen-Ohm M.P., Klenchin V.A., Chanda B. Structure and dynamics underlying elementary ligand binding events in human pacemaking channels. eLife. 2016;5:e20797. doi: 10.7554/eLife.20797. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Stennett E.M.S., Ciuba M.A., Levitus M. Demystifying PIFE: the photophysics behind the protein-induced fluorescence enhancement phenomenon in Cy3. J. Phys. Chem. Lett. 2015;6:1819–1823. doi: 10.1021/acs.jpclett.5b00613. [DOI] [PubMed] [Google Scholar]
35.Vlijm R., Smitshuijzen J.S.J., Dekker C. NAP1-assisted nucleosome assembly on DNA measured in real time by single-molecule magnetic tweezers. PLoS One. 2012;7:e46306. doi: 10.1371/journal.pone.0046306. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Isojima H., Iino R., Tomishige M. Direct observation of intermediate states during the stepping motion of kinesin-1. Nat. Chem. Biol. 2016;12:290–297. doi: 10.1038/nchembio.2028. [DOI] [PubMed] [Google Scholar]
37.Yoo J., Kim J.-Y., Chung H.S. Fast three-color single-molecule FRET using statistical inference. Nat. Commun. 2020;11:3336. doi: 10.1038/s41467-020-17149-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Tang J., Sun Y., Han K.Y. Spatially encoded fast single-molecule fluorescence spectroscopy with full field-of-view. Sci. Rep. 2017;7:10945. doi: 10.1038/s41598-017-10837-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figs. S1– S31

mmc1.pdf^{(40.5MB, pdf)}

Data S1.

mmc2.zip^{(17.3KB, zip)}

Document S2. Article plus supporting material

mmc3.pdf^{(42.5MB, pdf)}

Data Availability Statement

Software implementing AutoDISC is available at https://github.com/marcel-goldschen-ohm/AutoDISC.

[bib1] 1.Chen J., Dalal R.V., Puglisi J.D. High-throughput platform for real-time monitoring of biological processes by multicolor single-molecule fluorescence. Proc. Natl. Acad. Sci. USA. 2014;111:664–669. doi: 10.1073/pnas.1315735111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.Soltani M., Lin J., Wang M.D. Nanophotonic trapping for precise manipulation of biomolecular arrays. Nat. Nanotechnol. 2014;9:448–452. doi: 10.1038/nnano.2014.79. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Yan R., Moon S., Xu K. Spectrally resolved and functional super-resolution microscopy via ultrahigh-throughput single-molecule spectroscopy. Acc. Chem. Res. 2018;51:697–705. doi: 10.1021/acs.accounts.7b00545. [DOI] [PubMed] [Google Scholar]

[bib4] 4.Altman R.B., Terry D.S., Blanchard S.C. Cyanine fluorophore derivatives with enhanced photostability. Nat. Methods. 2011;9:68–71. doi: 10.1038/nmeth.1774. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Halabi E.A., Pinotsi D., Rivera-Fuentes P. Photoregulated fluxional fluorophores for live-cell super-resolution microscopy with no apparent photobleaching. Nat. Commun. 2019;10:1232. doi: 10.1038/s41467-019-09217-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Juette M.F., Terry D.S., Blanchard S.C. Single-molecule imaging of non-equilibrium molecular ensembles on the millisecond timescale. Nat. Methods. 2016;13:341–344. doi: 10.1038/nmeth.3769. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7.Miller H., Zhou Z., Leake M.C. Single-molecule techniques in biophysics: a review of the progress in methods and applications. Rep. Prog. Phys. 2018;81:024601. doi: 10.1088/1361-6633/aa8a02. [DOI] [PubMed] [Google Scholar]

[bib8] 8.Zhou X., Wong S.T.C. Informatics challenges of high-throughput microscopy. IEEE Signal Process. Mag. 2006;23:63–72. [Google Scholar]

[bib9] 9.Blanco M., Walter N.G. Analysis of complex single-molecule FRET time trajectories. Methods Enzymol. 2010;472:153–178. doi: 10.1016/S0076-6879(10)72011-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Bronson J.E., Fei J., Wiggins C.H. Learning rates and states from biophysical time series: a Bayesian approach to model selection and single-molecule FRET data. Biophys. J. 2009;97:3196–3205. doi: 10.1016/j.bpj.2009.09.031. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] 11.McKinney S.A., Joo C., Ha T. Analysis of single-molecule FRET trajectories using hidden Markov modeling. Biophys. J. 2006;91:1941–1951. doi: 10.1529/biophysj.106.082487. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] 12.Blanco M.R., Johnson-Buck A.E., Walter N.G. In: Encyclopedia of Biophysics. Roberts G.C.K., editor. Springer; 2013. Hidden Markov modeling in single-molecule biophysics; pp. 971–975. [Google Scholar]

[bib13] 13.White D.S., Goldschen-Ohm M.P., Chanda B. Top-down machine learning approach for high-throughput single-molecule analysis. eLife. 2020;9:e53357. doi: 10.7554/eLife.53357. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] 14.Dempsey G.T., Bates M., Zhuang X. Photoswitching mechanism of cyanine dyes. J. Am. Chem. Soc. 2009;131:18192–18193. doi: 10.1021/ja904588g. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] 15.Levene M.J., Korlach J., Webb W.W. Zero-mode waveguides for single-molecule analysis at high concentrations. Science. 2003;299:682–686. doi: 10.1126/science.1079700. [DOI] [PubMed] [Google Scholar]

[bib16] 16.Greenfeld M., Pavlichin D.S., Herschlag D. Single Molecule Analysis Research Tool (SMART): an integrated approach for analyzing single molecule data. PLoS One. 2012;7:e30024. doi: 10.1371/journal.pone.0030024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] 17.Hines K.E., Bankston J.R., Aldrich R.W. Analyzing single-molecule time series via nonparametric Bayesian inference. Biophys. J. 2015;108:540–556. doi: 10.1016/j.bpj.2014.12.016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] 18.Sgouralis I., Pressé S. An introduction to infinite HMMs for single-molecule data analysis. Biophys. J. 2017;112:2021–2029. doi: 10.1016/j.bpj.2017.04.027. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19.Celik N., O’Brien F., Barrett-Jolley R. Deep-channel uses deep neural networks to detect single-molecule events from patch-clamp data. Commun. Biol. 2020;3:3. doi: 10.1038/s42003-019-0729-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] 20.Li J., Zhang L., Walter N.G. Automatic classification and segmentation of single-molecule fluorescence time traces with deep learning. Nat. Commun. 2020;11:5833. doi: 10.1038/s41467-020-19673-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] 21.Xu J., Qin G., Fang X. Automated stoichiometry analysis of single-molecule fluorescence imaging traces via deep learning. J. Am. Chem. Soc. 2019;141:6976–6985. doi: 10.1021/jacs.9b00688. [DOI] [PubMed] [Google Scholar]

[bib22] 22.Kadane J.B., Lazar N.A. Methods and criteria for model selection. J. Am. Stat. Assoc. 2004;99:279–290. [Google Scholar]

[bib23] 23.Shuang B., Cooper D., Landes C.F. Fast step transition and state identification (STaSI) for discrete single-molecule data analysis. J. Phys. Chem. Lett. 2014;5:3157–3161. doi: 10.1021/jz501435p. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] 24.Holden S.J., Uphoff S., Kapanidis A.N. Defining the limits of single-molecule FRET resolution in TIRF microscopy. Biophys. J. 2010;99:3102–3111. doi: 10.1016/j.bpj.2010.09.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] 25.Van Oostveldt P., Verhaegen F., Messens K. Heterogeneous photobleaching in confocal microscopy caused by differences in refractive index and excitation mode. Cytometry. 1998;32:137–146. [PubMed] [Google Scholar]

[bib26] 26.Akaike H. A new look at the statistical model identification. IEEE Trans. Automat. Contr. 1974;19:716–723. [Google Scholar]

[bib27] 27.Hannan E.J., Quinn B.G. The determination of the order of an autoregression. J. R. Stat. Soc. B. 1979;41:190–195. [Google Scholar]

[bib28] 28.Schwarz G. Estimating the dimension of a model. Ann. Stat. 1978;6:461–464. [Google Scholar]

[bib29] 29.Priestley M.B. Elsevier; London: 2004. Spectral Analysis and Time Series. [Google Scholar]

[bib30] 30.Bishop C. Springer; New York: 2006. Pattern Recognition and Machine Learning. [Google Scholar]

[bib31] 31.Loeff L., Kerssemakers J.W.J., Dekker C. AutoStepfinder: a fast and automated step detection method for single-molecule analysis. Patterns (N Y) 2021;2:100256. doi: 10.1016/j.patter.2021.100256. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] 32.Van Rijsbergen C.J. Second Edition. Butterworths; London: 1979. Information Retrieval. [Google Scholar]

[bib33] 33.Goldschen-Ohm M.P., Klenchin V.A., Chanda B. Structure and dynamics underlying elementary ligand binding events in human pacemaking channels. eLife. 2016;5:e20797. doi: 10.7554/eLife.20797. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] 34.Stennett E.M.S., Ciuba M.A., Levitus M. Demystifying PIFE: the photophysics behind the protein-induced fluorescence enhancement phenomenon in Cy3. J. Phys. Chem. Lett. 2015;6:1819–1823. doi: 10.1021/acs.jpclett.5b00613. [DOI] [PubMed] [Google Scholar]

[bib35] 35.Vlijm R., Smitshuijzen J.S.J., Dekker C. NAP1-assisted nucleosome assembly on DNA measured in real time by single-molecule magnetic tweezers. PLoS One. 2012;7:e46306. doi: 10.1371/journal.pone.0046306. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] 36.Isojima H., Iino R., Tomishige M. Direct observation of intermediate states during the stepping motion of kinesin-1. Nat. Chem. Biol. 2016;12:290–297. doi: 10.1038/nchembio.2028. [DOI] [PubMed] [Google Scholar]

[bib37] 37.Yoo J., Kim J.-Y., Chung H.S. Fast three-color single-molecule FRET using statistical inference. Nat. Commun. 2020;11:3336. doi: 10.1038/s41467-020-17149-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib38] 38.Tang J., Sun Y., Han K.Y. Spatially encoded fast single-molecule fluorescence spectroscopy with full field-of-view. Sci. Rep. 2017;7:10945. doi: 10.1038/s41598-017-10837-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Unsupervised selection of optimal single-molecule time series idealization criterion

Argha Bandyopadhyay

Marcel P Goldschen-Ohm

Abstract

Significance

Introduction