Abstract
Colocalization single-molecule methods can provide a wealth of information concerning the ordering and dynamics of biomolecule assembly. These have been used extensively to study the pathways of spliceosome assembly in vitro. Key to these experiments is the measurement of binding times—either the dwell times of a multi-molecular interaction or times in between binding events. By analyzing hundreds of these times, many new insights into the kinetic pathways governing spliceosome assembly have been obtained. Collections of binding times are often plotted as histograms and can be fit to kinetic models using a variety of methods. Here, we describe the use of maximum likelihood methods to fit dwell time distributions without binning. In addition, we discuss several aspects of analyzing these distributions with histograms and pitfalls that can be encountered if improperly binned histograms are used. We have automated several aspects of maximum likelihood fitting of dwell time distributions in the AGATHA software package.
Keywords: single-molecule, fluorescence, spliceosome, dynamics, software, fitting
1. Introduction
The spliceosome is an extremely complex and highly dynamic molecular machine found in eukaryotes [1]. It carries out precursor mRNA (pre-mRNA) splicing by concerted removal of intronic sequences and ligation of the flanking exons. The splicing process requires the coordinated action of five small nuclear ribonucleoprotein particles (snRNPs): U1, U2, U4, U5 and U6. Each snRNP contains a uridine-rich small nuclear RNA (U snRNA) and several snRNP-specific proteins [2]. In addition to large-scale conformational rearrangements of the snRNPs, numerous other splicing factors assemble, rearrange and/or dissociate from the spliceosome during each step of splicing [2–5]. Single-molecule fluorescence microscopy methods such as single-molecule FRET (smFRET) and colocalization single-molecule spectroscopy (CoSMoS) have revealed the transient behaviors of the spliceosome that are often obscured by ensemble techniques. In fact, splicing was first discovered through single-molecule imaging of RNA/DNA hybrids using electron microscopy [6, 7]. Recent high resolution cryo-EM structures have revealed the overall structure, and detailed inner-workings of the several key states of the spliceosome [4–6]. The structural rearrangements observed in these different states have revolutionized our understanding of splicing mechanism as well as validated key single-molecule results concerning juxtaposition of the sites of splicing chemistry prior to 5’ splice site cleavage [8–11].
In addition to pre-mRNA splicing, CoSMoS and other colocalization approaches have been used to study many other multistep biochemical processes including transcription, translation, DNA replication, and actin filament branching [12–18]. In general, colocalization experiments involve observation of the binding and release of fluorescent molecules from a surface-tethered substrate. Often this is enabled by the use of spectrally distinguishable fluorophores (e.g., Cy3 and Cy5), which can be individually excited and detected [15]. This has allowed multiple fluorescent species to be followed simultaneously, providing unique insights into biomolecular assembly and disassembly pathways. Early work on the S. cerevisiae (yeast) splicing machinery revealed that spliceosomes assemble on pre-mRNA in a partially ordered pathway with multiple reversible steps, potentially identifying points of regulation [19, 20]. Critically, these experiments also revealed quantitative kinetic information about several discrete steps in splicing—something which was not possible using earlier approaches such as native gel electrophoresis of cellular splicing extracts.
In this article, we discuss and compare statistical methods that are used to obtain the fit parameters associated with CoSMoS data of spliceosome assembly. We also introduce the A GATHering of Analyses (AGATHA) software package that we have developed to facilitate maximum likelihood fitting of single-molecule data and its statistical analysis. We illustrate the use of AGATHA in fitting data related to assembly of splicing factors on RNAs; however, these maximum likelihood methods are generally useful and can be used to analyze single molecule data originating from many different types of experiments beyond pre-mRNA splicing.
2. Example Data and Initial Analysis
2.1. RNA Binding Dynamics of a Yeast Splicing Factor
In order to demonstrate the methods used in statistical analysis of binding times obtained from single-molecule experiments, we will use two recently published data sets describing the binding of the yeast splicing factor branchpoint bridging protein (BBP) to pre-mRNA substrates containing or lacking the branch site (BS) [21]. In these experiments, Larson et. al showed that the presence of a BS promotes longer binding of a fluorescently-tagged BBP molecule to a surface-immobilized RNA. CoSMoS experiments were performed using a custom built, micromirror TIRF microscope in which the laser excitation beams enter and exit through the objective. The workflow for constructing this microscope has already been published [22]. Pre-mRNAs, labeled with a red laser-excited Cy5 fluorophore, were first immobilized on a functionalized glass slide. Whole cell extract containing BBP protein labeled with a green-laser excited Dy549 fluorophore was then added. This experimental set-up for two color CoSMoS is schematically illustrated in Figure 1A. Individual fluorophores were visualized as discrete spots of intensity, allowing the locations of the RNA and splicing factors to be determined. Images were then recorded from the camera over time, creating movies of “red” immobilized RNAs and “green” dynamic BBP proteins. Detailed descriptions of the experimental set-up and data collection can be found elsewhere [19, 21–26].
2.2. Obtaining a List of Dwell Times from Movies of Single Molecules
In the above experiments with BBP, the fluorescence signal from the surface tethered pre-mRNAs was then used to define Areas Of Interest (AOIs). AOIs were then mapped from the >635 nm field of view (FOV) corresponding to the “red” pre-mRNA locations to the <635 nm FOV in which the “green” BBP was imaged [25]. This was then followed by pixel intensity integration over each AOI, which produced a BBP fluorescence intensity trajectory at each pre-mRNA location (Figure 1B). In this example, the peaks in fluorescence intensity were identified by changes in signal that exceeded a threshold value of 3.2σs, where σs represents the baseline noise of the fluorescence trajectory. In effect, the association/dissociation of BBP on an individual RNA corresponds to the appearance/disappearance of fluorescence peaks from the AOI. The details about mapping and spot discrimination methods that can be used to obtain the fluorescence intensity trajectories has been previously described [25].
Often a single AOI will show multiple binding events (cf. Figure 1B), and each binding event is characterized by its own binding or dwell time. The dwell times observed will depend on the biochemical properties of the system studied. For example, inspection of individual fluorescence trajectories of BBP binding to a pre-mRNA containing a BS reveals both short and long events (Figure 1B). However, when a pre-mRNA lacks a BS, fluorescence trajectories of BBP binding reveal primarily short events (Figure 1C). This is expected since BBP should most strongly associate with RNAs containing the 5’-UACUAAC-3’ BS sequence [27].
2.3. Plotting the Single-Molecule Data as a Distribution of Dwell Times
A single CoSMoS experiment can yield hundreds of dwell times derived from many different binding events occurring on many different molecules. It is often beneficial to first plot the dwell time distribution as a probability density (PD) histogram. In this method, dwell times are first binned, and the population in each bin (Nbin) is then divided by the product of the bin width (w) and total number of events [Ntot; PD = Nbin/(w × Ntot)]. The probability density histograms of dwell times for BBP on RNAs with or without a BS are compared in Figure 1D. The dwell time distribution for BBP binding on RNA that lacks a BS (dark green) is narrower (shifted towards shorter dwell times) than that obtained from BBP binding to RNA containing a BS (light green). This arises due to the scarcity of long-lived binding events in the absence of the BS. The simplest binding mechanism of BBP on pre-mRNA (R) without a BS can be described as a single-step process:
(1) |
In contrast, the broader distribution of BBP dwell times on the wild-type RNA could be due to the presence of two or more populations of BBP-RNA complexes.
A more quantitative and theoretical analysis of the dwell time distributions can provide additional information about kinetic features of the BBP-RNA complexes. The probability density function (PDF) for the lifetime in an individual state can be described as an exponential distribution [28]. For mechanisms with multiple states, the probability density function is the sum of the exponential distributions [28]. A general expression for PDF with k states can be written as:
(2) |
where τi, and ai, are the time constant and relative amplitude of the ith state respectively, such that ai satisfies the constraint ∑ai = 1 It is of significant interest to know the characteristic time constants, τi, for each complex as they provide information about the interconversion of the complexes and their relative kinetic stabilities. The values of these time constants can be extracted by fitting an appropriate equation to the measured data as discussed below.
3. Methods for Fitting Distributions of Dwell Times
3.1. Obtaining the Fit Parameters and Associated Errors
The method of least squares is frequently used to estimate the best fit parameters. Although this approach is straightforward and powerful, it can have its pitfalls if not used carefully [29–32]. This is particularly apparent when used to fit data which are not normally distributed. An alternative approach is the Maximum Likelihood (ML) estimation [33, 34]. For a sufficiently large dataset, different methods should ideally yield the same estimates for the fit parameters. However, in practice, the extracted fit parameters can often depend on the chosen method. This will be illustrated in Section 3.3 by comparing the fit results obtained from two independent methods. For simplicity, we will focus the discussion below on fitting and error estimates of kinetic parameters using the ML approach since it is likely less familiar to most biochemists.
Using Equation (2), the probability density for observing the first data point, t1, reads as
(3) |
As the measurement of one dwell time is independent of any other dwell time observation within an experiment, the probability density for observing all the n measured data points, t1, t2…and, tn can be written as a product of the individual probability densities. This total probability density defines the likelihood function (Lik (τi, ai)):
(4) |
In other words, the likelihood function characterizes the probability to observe a particular set of dwell time values obtained from an experiment. Maximizing the function, Lik (τi, ai), with respect to the parameters τi and ai will make the observed data most probable. Hence, the values of τi and ai that yield a global maximum of Lik (τi, ai), are the best fit parameters of the PDF to the experimentally observed distribution.
It is important to note that the experimental conditions set limits on the measured dwell times (t), tm ≤ t ≤ tx, such that nothing shorter than tm can be measured in an experiment of duration tx. The parameter tm is often limited by the camera frame rate. These constraints on the dwell times calls for a conditional PDF instead of Equation (2), which can be defined as
(5) |
Similarly, one could obtain the conditional PDF for bi-exponential distribution,
(6) |
with a1 + a2 = 1.
To obtain the best fit of Equation (5) to the dwell time distribution of BBP on RNA without a BS (Figure 1D), we maximize the logarithmic likelihood function:
(7) |
Optimizing the product of the probabilities (Equation 4) is often computationally inefficient since this product can yield a very small number. With increasing number of data points, this product can run out of precision very quickly due to the floating-point arithmetic used by computers. Therefore, it is better to maximize the log of the likelihood function as it converts the product of the individual probability densities to summation and preserves the fitting results.
Figure 2A shows the plot between L(τ) vs τ in which L(τ) gets a maximum value of −909.6 at τmax = 8.6s. This τmax value is the ML estimate for the fit parameter τ for BBP on RNA without BS. In other words, this parameter indicates that BBP has a characteristic dwell time of 8.6 s when associating with RNAs lacking a BS sequence.
Similarly, one could obtain the ML estimates for a1, τ1, a2, and τ2 of the double exponential PDF [Equation (6)], which is useful for describing the dwell time data set of BBP on WT RNA. In this case, the more complicated equation is necessary to correctly fit the appearance of both long and short dwell times in the data set when BBP binds RNAs containing a BS sequence. A contour plot of the logarithmic likelihood function, L(τ1,τ2) [corresponding to the double exponential PDF, Equation (6)], is plotted as a function of τ1 and τ2 by holding a1 constant (Figure 2B). L(τ1,τ2) obtains a maximum value of −1639.5 at τ1 = 12.9 s and τ2 =119.3 s with the ML estimate for a1 = 0.74.
Apart from estimating the optimized fit parameters, it is equally important to quantify the errors associated with the fit parameters. There are many possible ways to estimate the errors: a standard approach to assess the standard deviations corresponding to the parameters estimates is by finding the diagonal elements of the covariance matrix of Lik (θi) with respect to fit variables, θis [35]. Here, the covariance matrix can be written as C (θ) = I (θ)−1, where
(8) |
θimax, and θjmax are the ML estimates for θi, and θj respectively. For a single exponential distribution, it is straightforward from Equations (5) and (8) to obtain an analytical expression for standard deviation, , where τmax is the ML estimate of τ. With a total of 288 binding events/dwell times, and τmax = 8.6 s. (data corresponding to Figure 2A) the standard deviation turns out to be ~0.5 s. It is more difficult to obtain the analytical expressions for the standard deviations associated with all parameters of higher order exponential distributions. As a result, one can approach these problems using numerical analysis.
Another way of estimating the error in fit parameters is by finding likelihood intervals. The likelihood intervals (i.e., the ranges for the fit parameters) are the values most probable within certain neighborhoods around the maxima [29]. For example, consider the line, L(τmax) − m plotted against the likelihood curve. The points of intersection of these curves, τlow and τhigh, will provide a good estimate for the uncertainty in τmax (Figure 2A). The error estimate, in this particular case, depends solely on the value of m. The likelihood intervals for m = 0.5, and m =2 correspond to one and two standard deviation limits respectively [35]. For higher order exponential distributions, a similar procedure can be employed by estimating the error on one parameter while keeping the other parameters constant. Likelihood intervals estimates for a1, τ1 and τ2 are shown in Table 1 for a distribution containing two exponential terms. Likelihood intervals estimates are relatively easy to obtain for a single exponential fit but can become laborious with increasing numbers of variables.
Table 1.
RNA | Function | Parameter | ML estimate | Likelihood Intervals | Bootstrap Mean |
Confidence Intervals | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
m = 0.5 68% |
m = 2 95% |
σ 68% |
2σ 95% |
|||||||||
Without BS | Single | τ (s) | 8.6 | 8.1 | 9.1 | 7.6 | 9.6 | 8.6 | 7.9 | 9.2 | 7.2 | 10.0 |
−0.5 | 0.5 | −0.9 | 1.1 | −0.7 | 0.7 | −1.4 | 1.4 | |||||
WT RNA | Double | a1 | 0.74 | 0.70 | 0.77 | 0.67 | 0.79 | 0.74 | 0.69 | 0.78 | 0.65 | 0.82 |
−0.03 | 0.03 | −0.06 | 0.06 | −0.04 | 0.04 | −0.08 | 0.08 | |||||
τ1(s) | 12.9 | 11.9 | 13.9 | 10.9 | 15.2 | 12.9 | 11.6 | 14.2 | 10.3 | 15.5 | ||
1.0 | 1.0 | −2.0 | 2.3 | −1.3 | 1.3 | −2.6 | 2.6 | |||||
τ2(s) | 119.4 | 107.2 | 133.9 | 96.5 | 151.1 | 120.9 | 104.4 | 137.4 | 87.8 | 154.0 | ||
−12.2 | 14.6 | −22.9 | 31.6 | −16.5 | 16.5 | −33.1 | 33.1 |
In many cases, the statistical method of bootstrapping is advantageous over the aforementioned methods in estimating the errors of the fit parameters [36]. Bootstrapping is a resampling method in which a new data set is generated from the observed data by random sampling, with the new and original data sets being of the same size. Ideally, this resampling method preserves the actual distribution of the parameters present in the observed data set. An example of the bootstrap analysis is illustrated in Figure 2C, where 1000 data sets were simulated from the dwell times for BBP on RNA without a BS. The ML estimates for τ were obtained for all 1000 data sets. The distribution of ML estimates for τ was analyzed by plotting a probability density histogram and then fitting to a Gaussian distribution. The Gaussian fit yields a mean value of 8.6 s and standard deviation of 0.7 s for τ, which are comparable to the ML estimate and 0.5-unit likelihood intervals (Figure 2A). In a similar fashion, one could obtain the uncertainty in the estimates for a large number of parameters in a fit. A direct comparison of the error estimates for fit parameters obtained from the likelihood intervals, and the bootstrap analysis can be found in Table 1.
3.2. Determining the Goodness of the Fit
Although ML is a powerful technique, care should be taken in assessing the goodness of the fit to the unbinned data. This can be done by using statistical tests such as the likelihood ratio or Akaike Information Criterion (AIC) for model selection based on the likelihoods [37, 38]. For example, a log likelihood ratio test can identify if the dwell time distribution for BBP association with WT RNA is better described by single or double exponential PDFs. The MATLAB function Iratiotest efficiently implements this procedure and, in this example, results in rejection of the model based on a single exponential PDF. For fitting of data sets with unknown kinetic features, it is often advisable to begin fitting to a single exponential PDF. The log likelihood ratio test or AIC can then be used to test if the simplest model is sufficient or if more complicated PDFs are needed to model the data. Figure 2D shows good agreements between the data and the fit curves for BBP dwell times on RNAs with and without a BS.
Critically, it is important to consider the histogram binning since one could easily bias the fit if the histogram is not binned properly. For example, we created a histogram with six bins of equal width (100 s each) for the dwell time data set of BBP binding to WT RNA along with the curve obtained using a ML fit of the unbinned data (Figure 3A). It is evident that the ML fit curve (red) deviates significantly from the equally binned histogram as well as the curve obtained from least squares fitting of the bin centers (blue line and black points). To correct this, one can construct an unequally binned histogram with narrow bin widths for shorter intervals. We have plotted the same ML curve along with unequally binned histograms of the same data set in Figures 3B and C. The agreement between the ML fit and the histogram gets better with increasing number of unequal bins.
3.3. Comparison Between Maximum Likelihood and Least Square Fitting
The data plotted in Figure 3 also illustrate a potential pitfall of least squares fitting of dwell time distributions. In this case, the least squares fits were obtained using the curve fitting application of MATLAB (Table 2). With least squares fitting, it is possible to obtain ill-defined fit parameters with large standard deviations despite having reasonable R2 or adjusted R2 values. In this case, the least squares fitting is improved by increasing the number of bins and by using variable bin sizes. If the bin number is large, the least squares predictions for the parameters approach those obtained by ML estimates (compare parameters in Table 1 vs. Table 2). However, the least squares method results in broader confidence intervals as compared to the ML error estimates.
Table 2.
No. of Bins | Bin Size | Parameter | Non Linear Least Square Fit | Confidence Intervals | R2/Adj R2 | Corresponding Figure | |||
---|---|---|---|---|---|---|---|---|---|
68% | 95% | ||||||||
6 | Equal | a1 | 0.91 | −1.32 | 3.14 | −5.06 | 6.88 | 0.9465/0.9108 | 3A |
τ1(s) | 38.5 | −29.2 | 106.1 | −142.6 | 219.5 | ||||
τ2(s) | 116.1 | −2399 | 2631 | −6617 | 6849 | ||||
6 | Variable | a1 | 0.82 | 0.74 | 0.91 | 0.59 | 1.06 | 0.9996/0.9994 | 3B |
τ1(s) | 15.4 | 13.5 | 17.2 | 10.4 | 20.3 | ||||
τ2(s) | 104.9 | 22.1 | 188.0 | −116.7 | 326.5 | ||||
9 | Variable | a1 | 0.70 | 0.65 | 0.75 | 0.59 | 0.81 | 0.9992/0.9989 | 3C |
τ1(s) | 12.4 | 11.6 | 13.2 | 10.6 | 14.2 | ||||
τ2(s) | 107.4 | 69.3 | 145.5 | 21.4 | 193.5 |
Additionally, least square fits can be highly sensitive to user inputs for upper and lower bounds for the fit coefficients as well as sample size. To see the effect of the latter, we simulated data sets of different sizes with a1 = 0.75, τ1 = 10.0 s, and τ2 = 100.0 s. As sample size increases, ML estimates gets very close to the input parameters with narrower confidence intervals (Table 3). However, increasing the number of bins with these large data sets does result in overestimated values of τ2 in least squares fits (Table 3). This can be attributed to the fact that the least squares method is very sensitive to outliers, assumes the variables to be independent, and the error to be normal. In cases where error terms are not normal, the confidence intervals of the least square estimates are not reliable [24–26]. In our simulation, maximum likelihood outperforms the least squares method for typical “single molecule”-sized data sets of 100–1000 data points.
Table 3.
Data points | Number of Bins | Bin Size | Parameter | Maximum Likelihood Results* |
Nonlinear Least Squares Results* |
R2/Adj R2 |
---|---|---|---|---|---|---|
100000** | 1000 | Equal (1 s/bin) | a1 = 0.75 | 0.74 (0.74–0.74) | 0.72 (0.72–0.72) | 0.0.9997/0.9997 |
τ1 = 10 s | 10.9 (10.9–10.9) | 10.2 (10.1–10.2) | ||||
τ2 = 100 s | 101.8 (100.3–103.1) | 123.4 (120.6–126.1) | ||||
10000** | 1000 | Equal (1 s/bin) | a1 = 0.75 | 0.75 (0.73–0.76) | 0.73 (0.69–0.72) | 0.0.9975/0.9975 |
τ1 = 10 s | 10.9 (10.5–11.3) | 10.2 (10.1–10.2) | ||||
τ2 = 100 s | 102.7 (107.5–97.1) | 122.0 (114.7–130.1) | ||||
10000* | 15 | Variable | a1 = 0.75 | 0.75 (0.73–0.76) | 0.76 (0.74–0.79) | 0.0.9999/0.9999 |
τ1 = 10 s | 10.9 (10.5–11.3) | 10.8 (10.5–11.1) | ||||
τ2 = 100 s | 102.7 (107.5–97.1) | 116.0 (80.5–151.5) | ||||
1000 | 100 | Variable | a1 = 0.75 | 0.76 (0.72–0.80) | 0.73 (0.67–0.81) | 0.9951/0.9950 |
τ1 = 10 s | 9.5 (8.3–10.7) | 10.4 (10.1–12.8) | ||||
τ2 = 100 s | 102.7 (85.7–119.7) | 124.8 (88.3–161.4) | ||||
1000 | 10 | Variable | a1 = 0.75 | 0.76 (0.72–0.80) | 0.76 (0.72–0.80) | 0.9999/0.9998 |
τ1 = 10 s | 9.5 (8.3–10.7) | 11.08 (10.6–11.6) | ||||
τ2 = 100 s | 102.7 (85.7–119.7) | 114.4 (60.9–167.9) | ||||
100 | 10 | Variable | a1 = 0.75 | 0.79 (0.59–0.99) | 0.68 (0.72–1.00) | 0.9988/0.9982 |
τ1 = 10 s | 8.9 (3.3–14.5) | 6.9 (9.8–14.1) | ||||
τ2 = 100 s | 90.7 (17.2–193.2) | 50.0 (−3.9 to 103.9) |
Intervals for each fitting method are shown in parentheses.
Maximum likelihood fitting results obtained using MEMLET software [34]. MEMLET is more efficient at processing large data sets (≥10000 data points) than AGATHA software.
4. Use of AGATHA Software for ML Fitting
Here, we introduce “AGATHA” (A GATHering of Analyses), a MATLAB-based software package that provides tools for the analysis of the dwell times obtained from CoSMoS experiments (https://github.com/hoskinslab/AGATHA). AGATHA includes a number of subprograms including those for ML analysis (Plotting Histogram), identifying patterns of signal appearance (Sequential Arrival, Simultaneous Arrival, and Short Counter), photobleaching analysis (Counting Photobleaching Steps), and data visualization (Two Color Plot). These programs are accessed via the AGATHA GUI (Figure 4). The Sequential Arrival and Simultaneous Arrival programs are useful for deducing pathways of signal appearance and disappearance in three color CoSMoS experiments (i.e., determining pathways of biomolecular assembly or disassembly [15]). These programs classify binding events into various categories depending upon times of signal appearance or disappearance. The Counting Photobleaching Steps program counts the number of bleaching steps present in a fluorescence intensity trace by fitting the data to a step function. This is useful for counting the number of fluorophores (biomolecules) present in a molecular assembly. Instruction manuals for each of these programs are found in their respective GUIs. Here, we restrict ourselves to the Plotting Histogram program as the others are beyond the scope of this article. We also note that Woody et. al have independently developed a similar program, MEMLET (MATLAB Enabled Maximum Likelihood Estimate Tools), that utilizes the ML approach to fit data by providing a variety of general or user defined PDFs [34].
4.1. Plotting Histograms
The Plotting Histogram program (PH) facilitates plotting of dwell time data using various methods for bin size selection as well as ML fitting of the unbinned data. PH calculates the appropriate number of bins from the chosen method (described below) and also can remove empty bins by combining neighboring bins. Along with the histogram, it displays the error in the counting statistics of each bin center by calculating the binomial distribution variance, , as, , where n is the total number of the data points, and P is the probability of the binding event [39]. Finally, it returns the fit parameters and associated standard deviations by using ML and bootstrap analysis. AGATHA simplifies ML data analysis by requiring the user to supply the relevant inputs to entry widgets in the PH GUI (Figure 5, numbers 1–7). Fitting results are also displayed in widgets once the program has been run (Figure 5, numbers 8 and 9). Below we describe data entry and use of each of the widgets in the PH GUI.
4.2. Instructions for Using the Plotting Histogram Program
Mode: In this widget, the user either instructs the software to automatically calculate the number of bins plotted in a histogram (Automatic) or the user can manually input the bin edges in increasing order (Manual).
- Histogram: When Automatic is selected in widget 1, the user then selects one or more of the listed methods for calculating the number of bins in the histogram.
- Sturges: According to the Sturges rule, the number of the bins for a histogram are estimated based on the range of the given data. This calculates the number of bins, m as m = (1 + log2 (n)), where n is the total number of data points [28, 40]. It will perform poorly if the number of data points is less than 30 and the points are not normally distributed [41]. As dwell times often follow an exponential distribution (similar to Figure 3A), this method may fail to show an appropriate trend in the data.
- Freedman-Diaconis: This method is less sensitive to outliers in a given data, and might be more suitable for data with heavy-tailed distributions [42]. It uses a bin width, h, as where X is the dwell time data, n is number of data points, and IQR is the interquartile range of X.
- Scott: This method works better if the data is mostly normally distributed. However, this rule is appropriate for other distributions as well. It calculates bin width, h, as , where σX is the standard deviation of the data set X, and n is number of data points [43].
- Middle: This method make use of all three methods mentioned above, then choses the middle (median) value for bin numbers.
- Optimal: An optimization principle is used to minimize the expected least squares loss function between the histogram and an unknown underlying density function [43]. The optimal bin width, h*, is obtained as a minimizer of the formula, (2M − V)/h2, where M and V are mean and variance of the data points across bins with a width h. Optimal number of bins, m, are calculated as, m = (max(X) − min(X))/h*, where max(X) and min(X) are the maximum and minimum value of the given data set X. In our experience, this method is frequently used for plotting dwell time distributions obtained from CoSMoS experiments.
- All: This selects all of the above methods and runs them independently.
Events: In this widget, the user specifies whether or not the dwell time data is reported in units of time or camera frames.
Time Units and Intervals: The time units (seconds or milliseconds) are selected within this widget as well as the interval type from the drop-down menu. AGATHA uses input interval files generated by the GLIMPSE and IMSCROLL programs (available at https://github.com/gelles-brandeis/CoSMoS_Analysis) [25]. In these programs the dwell times are classified as different types of intervals, each assigned an integer value between −3 and +3. Details about event classification have been previously described [25] and depend on whether or not the binding the event has been observed in its entirety as well as whether or not binding events or times between binding events are being analyzed.
Function: PH is equipped with single, double and triple exponential probability distributions for fitting the measured data. These functions as labelled as Expfallone_mxl, Expfalltwo_mxl, and Expfallthree_mxl, respectively. PH currently includes equations for processing up to third order PDFs but can be expanded to higher distributions if needed.
Input PH Parameters: The user should enter the experimentally-constrained times Tx (length of the experiment) and Tm (minimum time that can be resolved by the experiment) along with a number for Nboot (number of datasets to be simulated for bootstrap analysis which is the same as the number of iterations of bootstrap analysis). For example, Nboot=1000 was used for Figure 2C. For single exponential distributions, the user should enter an initial estimate for Tau [τ in Equation (5)]. For bi-exponential PDFs, the user gives initial guesses for Tau1, Tau2, and ap. The input value ap is converted to a1 = 1/(1 + ap2) before maximizing the log likelihood in order to constrain between 0 and 1. Similarly, for tri-exponential distribution fit parameters are extended to Tau1, Tau2, Tau3, ap1 and ap2, and the a1, a2, and a3 are deduced using equations a1 = 1/(1 + ap12), a2 = (1 − a1)/(1 + ap22) and a1 + a2 + a3 = 1. If the initial guesses are far off, the program may crash and fail to find a solution. In which case, new values can be chosen and the analysis rerun.
Update: Clicking the update button will ask the user to select the intervals file to be analyzed and to create an output folder for the results.
Output Fitting: The ML estimates for the fit parameters are returned here.
Output Bootstrap data: The mean and standard deviation of the fit parameter values are displayed after bootstrap analysis. The histograms before and after the fitting will be saved in the same directory with the same name as the input interval file. The program also saves the bootstrap results for all the fitting parameters.
5. Conclusion
AGATHA and MEMLET facilitate ML fitting of complex single molecule data with user-friendly capabilities and options that complement standard software programs. MATLAB’s DF tool application only provides a single exponential function for fitting and cannot fit probability density distributions for multiple exponential or user-defined PDFs. Both AGATHA and MEMLET are capable of fitting data with multi-exponential PDFs and provide estimates and errors for fitting parameters using ML and bootstrapping techniques. Additionally, MEMLET directly provides likelihood ratio model testing, allows the user to input any PDF, and can take text or MATLAB variable files as input. On the other hand, AGATHA is supplemented with various tools for histogram binning and error calculation. Current versions of AGATHA require input in IMSCROLL format [21]; however, these types of files can be easily constructed from any data set.
In conclusion, ML fitting of unbinned dwell or binding time data is often preferable compared to least squares fitting of binned data sets, which can be skewed based on how the histogram has been constructed. Implementation of ML methods in MATLAB can be laborious. Fortunately, this is greatly simplified by the AGATHA software.
Highlights.
Single-molecule methods can measure discrete binding events between individual biomolecules
Maximum likelihood fitting of unbinned binding data can be used to determine kinetic parameters
AGATHA software automates many time-consuming steps in data fitting and histogram analysis
Acknowledgements
We thank Joshua Larson, Margaret Rodgers, and Clarisse van der Feltz for feedback on the manuscript and Laura Vanderploeg for assistance with figure artwork. We also thank Larry Friedman for helpful discussions and writing the initial MATLAB scripts for ML fitting as part of CoSMoS data analysis. This work was supported by the National Institutes of Health (R01 GM112735 to AAH; R01 GM099752 to AS), Shaw Scientist and Beckman Young Investigator Awards (to AAH), the National Science Foundation (CHE-1710182 to AS), and the Computation and Informatics in Biology and Medicine Training Program (National Library of Medicine training grant 5T15LM007359 to SGFC).
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
REFERENCES
- [1].Nilsen TW, The spliceosome: the most complex macromolecular machine in the cell?, Bioessays 25(12) (2003) 1147–1149. [DOI] [PubMed] [Google Scholar]
- [2].Wahl MC, Will CL, Lührmann R, The spliceosome: design principles of a dynamic RNP machine, Cell 136(4) (2009) 701–718. [DOI] [PubMed] [Google Scholar]
- [3].Chen W, Moore MJ, The spliceosome: disorder and dynamics defined, Current Opinion in Structural Biology 24 (2014) 141–149. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Fica SM, Nagai K, Cryo-electron microscopy snapshots of the spliceosome: structural insights into a dynamic ribonucleoprotein machine, Nature Structural & Molecular Miology 24(10) (2017) 791. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Shi Y, Mechanistic insights into precursor messenger RNA splicing by the spliceosome, Nature Reviews Molecular Cell Biology 18(11) (2017) 655. [DOI] [PubMed] [Google Scholar]
- [6].Berget SM, Moore C, Sharp PA, Spliced segments at the 5′ terminus of adenovirus 2 late mRNA, Proceedings of the National Academy of Sciences 74(8) (1977) 3171–3175. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Chow LT, Gelinas RE, Broker TR, Roberts RJ, An amazing sequence arrangement at the 5′ ends of adenovirus 2 messenger RNA, Cell 12(1) (1977) 1–8. [DOI] [PubMed] [Google Scholar]
- [8].Rauhut R, Fabrizio P, Dybkov O, Hartmuth K, Pena V, Chari A, Kumar V, Lee C-T, Urlaub H, Kastner B, Molecular architecture of the Saccharomyces cerevisiae activated spliceosome, Science (2016) aag1906. [DOI] [PubMed] [Google Scholar]
- [9].Krishnan R, Blanco MR, Kahlscheuer ML, Abelson J, Guthrie C, Walter NG, Biased Brownian ratcheting leads to pre-mRNA remodeling and capture prior to first-step splicing, Nature Structural & Molecular Miology 20(12) (2013) 1450. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Yan C, Wan R, Bai R, Huang G, Shi Y, Structure of a yeast activated spliceosome at 3.5 Å resolution, Science 353(6302) (2016) 904–911. [DOI] [PubMed] [Google Scholar]
- [11].Crawford DJ, Hoskins AA, Friedman LJ, Gelles J, Moore MJ, Single-molecule colocalization FRET evidence that spliceosome activation precedes stable approach of 5′ splice site and branch site, Proceedings of the National Academy of Sciences (2013) 201219305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Ticau S, Friedman LJ, Ivica NA, Gelles J, Bell SP, Single-molecule studies of origin licensing reveal mechanisms ensuring bidirectional helicase loading, Cell 161(3) (2015) 513–525. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Friedman LJ, Gelles J, Mechanism of transcription initiation at an activator-dependent promoter defined by single-molecule observation, Cell 148(4) (2012) 679–689. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Smith BA, Padrick SB, Doolittle LK, Daugherty-Clarke K, Corrêa IR Jr, Xu M-Q, Goode BL, Rosen MK, Gelles J, Three-color single molecule imaging shows WASP detachment from Arp2/3 complex triggers actin filament branch formation, Elife 2 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Larson JD, Rodgers ML, Hoskins AA, Visualizing cellular machines with colocalization single molecule microscopy, Chemical Society Reviews 43(4) (2014) 1189–1200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Uemura S, Aitken CE, Korlach J, Flusberg BA, Turner SW, Puglisi JD, Real-time tRNA transit on single translating ribosomes at codon resolution, Nature 464(7291) (2010) 1012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Aitken CE, Puglisi JD, Following the intersubunit conformation of the ribosome during translation in real time, Nature Structural & Molecular Biology 17(7) (2010) 793. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Zhao G, Gleave ES, Lamers MH, Single-molecule studies contrast ordered DNA replication with stochastic translesion synthesis, eLife 6 (2017) e32177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Hoskins AA, Friedman LJ, Gallagher SS, Crawford DJ, Anderson EG, Wombacher R, Ramirez N, Cornish VW, Gelles J, Moore MJ, Ordered and dynamic assembly of single spliceosomes, Science 331(6022) (2011) 1289–1295. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Shcherbakova I, Hoskins AA, Friedman LJ, Serebrov V, Corrêa IR Jr, Xu M-Q, Gelles J, Moore MJ, Alternative spliceosome assembly pathways revealed by single-molecule fluorescence microscopy, Cell Reports 5(1) (2013) 151–165. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Larson JD, Hoskins AA, Dynamics and consequences of spliceosome E complex formation, eLife 6 (2017) e27592. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Larson J, Kirk M, Drier EA, O’brien W, MacKay JF, Friedman LJ, Hoskins AA, Design and construction of a multiwavelength, micromirror total internal reflectance fluorescence microscope, Nature Protocols 9(10) (2014) 2317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Anderson EG, Hoskins AA, Single molecule approaches for studying spliceosome assembly and catalysis, Spliceosomal Pre-mRNA Splicing, Springer; 2014, pp. 217–241. [DOI] [PubMed] [Google Scholar]
- [24].Hansen S, Rodgers M, Hoskins A, Fluorescent Labeling of Proteins in Whole Cell Extracts for Single-Molecule Imaging, Methods in Enzymology, Elsevier; 2016, pp. 83–104. [DOI] [PubMed] [Google Scholar]
- [25].Friedman LJ, Gelles J, Multi-wavelength single-molecule fluorescence analysis of transcription mechanisms, Methods 86 (2015) 27–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Friedman LJ, Chung J, Gelles J, Viewing dynamic assembly of molecular complexes by multi-wavelength single-molecule fluorescence, Biophysical Journal 91(3) (2006) 1023–1031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Berglund JA, Chua K, Abovich N, Reed R, Rosbash M, The splicing factor BBP interacts specifically with the pre-mRNA branchpoint sequence UACUAAC, Cell 89(5) (1997) 781–787. [DOI] [PubMed] [Google Scholar]
- [28].Colquhoun D, Hawkes AG, The Principles of Stochastic Interpretation of Ion Channel Mechanisms, in: Sakmann B, Neher E (Eds.), Single Channel Recording, Plenum Press, New York, 1995, pp. 397–482. [Google Scholar]
- [29].Pearson K, On the systematic fitting of curves to observations and measurements, Biometrika 1(3) (1902) 265–303. [Google Scholar]
- [30].Myung IJ, Tutorial on maximum likelihood estimation, Journal of Mathematical Psychology, 47 (2003) 90–100. [Google Scholar]
- [31].Genschel U, Meeker WQ, A comparison of maximum likelihood and median-rank regression for Weibull estimation, Quality Engineering 22(4) (2010) 236–255. [Google Scholar]
- [32].Gaeuman D, Holt CR, Bunte K, Maximum likelihood parameter estimation for fitting bedload rating curves, Water Resources Research 51(1) (2015) 281–301. [Google Scholar]
- [33].Ra Fisher MA, On the mathematical foundations of theoretical statistics, Phil. Trans. R. Soc. Lond. A 222(594–604) (1922) 309–368. [Google Scholar]
- [34].Woody MS, Lewis JH, Greenberg MJ, Goldman YE, Ostap EM, MEMLET: An easy-to-use tool for data fitting and model comparison using maximum-likelihood estimation, Biophysical Journal 111(2) (2016) 273–282. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [35].Colquhoun D, Sigworth FJ, Fitting and Statistical Analysis of Single-Channel Records, in: Sakmann B, Never E (Eds.), Single Channel Recording, Plenum Press, New York, 1995, pp. 483–588. [Google Scholar]
- [36].Efron B, The jackknife, the bootstrap, and other resampling plans, Siam; 1982. [Google Scholar]
- [37].Wilks SS, The large-sample distribution of the likelihood ratio for testing composite hypotheses, The Annals of Mathematical Statistics 9(1) (1938) 60–62. [Google Scholar]
- [38].Akaike H, A new look at the statistical model identification, IEEE Transactions on Automatic Control 19(6) (1974) 716–723. [Google Scholar]
- [39].Young HD, Statistical Treatment of Experimental Data, McGraw Hill Book Company, Inc., New York, NY: (1962). [Google Scholar]
- [40].Sturges HA, The choice of a class interval, Journal of the American Statistical Association 21(153) (1926) 65–66. [Google Scholar]
- [41].Hyndman RJ, The problem with sturges rule for constructing histograms, (1995).
- [42].Freedman D, Diaconis P, On the histogram as a density estimator: L 2 theory, Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 57(4) (1981) 453–476. [Google Scholar]
- [43].Scott DW, On optimal and data-based histograms, Biometrika 66(3) (1979) 605–610. [Google Scholar]