Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Jan 6.
Published in final edited form as: Nanotechnology. 2016 Nov 29;28(1):015502. doi: 10.1088/0957-4484/28/1/015502

Classification of DNA nucleotides with transverse tunneling currents

Jonas Nyvold Pedersen 1,2, Paul Boynton 3, Massimiliano Di Ventra 3, Antti-Pekka Jauho 1,2, Henrik Flyvbjerg 1
PMCID: PMC5227067  NIHMSID: NIHMS835358  PMID: 27897144

Abstract

It has been theoretically suggested and experimentally demonstrated that fast and low-cost sequencing of DNA, RNA, and peptide molecules might be achieved by passing such molecules between electrodes embedded in a nanochannel. The experimental realization of this scheme faces major challenges, however. In realistic liquid environments, typical currents in tunnelling devices are of the order of picoamps. This corresponds to only six electrons per microsecond, and this number affects the integration time required to do current measurements in real experiments. This limits the speed of sequencing, though current fluctuations due to Brownian motion of the molecule average out during the required integration time. Moreover, data acquisition equipment introduces noise, and electronic filters create correlations in time-series data. We discuss how these effects must be included in the analysis of, e.g., the assignment of specific nucleobases to current signals. As the signals from different molecules overlap, unambiguous classification is impossible with a single measurement. We argue that the assignment of molecules to a signal is a standard pattern classification problem and calculation of the error rates is straightforward. The ideas presented here can be extended to other sequencing approaches of current interest.

Keywords: DNA, sequencing, electron tunneling, pattern classification, molecular signature, biosensing

1. Introduction

Identification and sequencing of single DNA, RNA, and peptide molecules is a key step in many diagnostic protocols. Electronic sequencing of nucleobases and nucleic acids with nanopores or nanogaps has received growing interest as an alternative to optical methods in the last two decades [14]. Nanopore sequencing, as originally conceived, records the ionic current through a nanopore that is partially blocked by a nucleotide and attempts to identify that nucleotide from its degree of blocking. However, due to the thickness of the nanochannels employed and the longitudinal direction of the ionic current probe, single-base resolution is difficult to achieve with this approach [1, 2]. For this reason, a complementary concept (“quantum sequencing” [5]) has been suggested, based on the specific molecular fingerprints in the transverse tunneling current that passes through the nucleotide when the latter passes between two electrodes in a nanochannel [57], see figure 1(a).

Figure 1.

Figure 1

(a) Schematic of DNA passing through a nanopore with embedded electrodes forming a nanogap. (b) Electrons tunnel between the electrodes via the nucleotide in the gap and produce a nucleotide-specific current I versus time t. (c) Here p(I|X) is the probability density for measuring the current value I, given that the nucleotide is X, where X is one of the four bases A, T, G, and C. Current signals from different nucleotides overlap, which prevents unambiguous classification [with a single current measurement].

With a break-junction as the electrode pair, single nucleotides have been identified experimentally by their respective transverse tunnelling currents [8, 9]. In addition, quantum sequencing has been used for identification of methylated DNA bases [10], for detection of post-translational modifications in single peptides [11], and for single-molecule spectroscopy of individual amino acids and peptides [12].

Current signals from single nucleotides have also been measured with a scanning-tunneling microscope (STM) [13]. With a functionalized STM tip, the individual nucleotides in a DNA oligomer have been read [14]. Nucleotides have also been identified with a fixed-gap device [15], and DNA molecules have been detected with nanowire-nanopore field-effect transistor sensors [16]. In all cases, the current signal was noisy and with step-like features, and a statistical analysis was required to get the actual sequence information, to determine the type of nucleotide, or just to detect a translocation event [17, 18].

In addition to these experimental efforts, simulations were found useful for testing alternative realizations of electronic nucleotide identification and nanopore sequencing [6, 7, 1928]. One such alternative, e.g., measures changes in the current in a graphene nanoribbon while a DNA string passes through a hole in the ribbon [22,27].

The experimental relevance of these simulations depends on the magnitude of the currents that can be measured experimentally—specifically, it depends on the integration time (bandwidth) required to obtain a signal that stands out well enough over noise and filtering effects to distinguish between different nucleotides. This is a critical issue for any type of sequencing protocol that employs either transverse tunnelling or longitudinal ionic currents.

The present article discusses the subtleties related to the connection between theoretical ideas and simulations with actual experiments. In section 2, we describe how the transverse current through individual nucleotides is simulated. Then we discuss the magnitude of the average current, the amplitude of current fluctuations, and the correlation time of current fluctuations. The correlation time is, interestingly, even shorter than the average waiting time between electrons tunneling through the nucleotide.

Tunneling currents are typically very small so that long integration times are needed to measure them in actual experiments. The reason is charge quantization: A current of 1 pA amounts to six electrons per microsecond, on average. Consequently, narrowly defined current values can be measured only with integration times much longer than microseconds. This limits the time resolution of current measurements, which can be ameliorated by multiplexing with several pairs of electrodes [29].

As a result, the integration time of data acquisition in a realistic experiment is long enough that current fluctuations due to thermal motion of nucleotides average out in a realistic recorded signal (section 3). Electronic noise, however, broadens the distribution of currents recorded for a given nucleotide, so current distributions for different nucleotides overlap (see figure 1(b,c) and section 4).

Electronic filters in the data acquisition system also affect the distribution of recorded currents and autocorrelate the time series of recorded currents (section 5). We show in section 6 how to assign a nucleotide to a current signal and that the autocorrelations play an important role in the assignment. Finally, in section 7 we compare the error rates of nucleotide assignment for simulated data with and without autocorrelations.

Throughout this article we consider only simulations of the transverse tunneling current through the four nucleotides A, T, G, and C. The analysis presented here is nevertheless also valid for other types of sensors that produce weak, overlapping current signals.

2. Magnitude and correlations of simulated current values

Nanopore experiments take place in a liquid environment at ambient temperature [5]. These conditions make simulations of the current through a single nucleotide both time consuming and computationally expensive [7] as they do not only involve the nucleotide of interest, but also the degrees of freedom of the surrounding molecules of the liquid. In previous work by one of us (MDV), the following protocol was used for simulating the transverse current through a single nucleotide as it passes through a nanopore [7,20]: The molecule is driven by a driving field into the nanopore where the electrodes are placed. Then the driving field is reduced and the transverse field is turned on. The molecule moves due to the electric fields and the thermal motion caused by interactions with the surrounding water molecules. This motion is described by molecular dynamics (MD) simulations with a time resolution of 1 fs. The femtosecond timescale is also the timescale for a typical electron transport time through the trapped molecule. Each picosecond the motion is frozen and a tight-binding Hamiltonian is set up which describes the coupling between the electrodes, the liquid and the DNA molecule. The steady-state current is calculated using a single-particle scattering approach with an applied bias of less than 1 V. Then the molecule is released for another time interval of one picosecond and the procedure is repeated many times (on the order of 4000 to 5000 times).

Figure 2 shows an example of a current trace for the nucleotide A, and histograms of the current values for all four nucleotides are shown in figure 3 as obtained in reference [29]. We here plot the log-current probability distributions p(Ĩ|X) with Ĩ = log10(I/Amp), and where X ∈ {A, T, G, C} denotes the four types of nucleotides. That is, the probability distributions for the current I is p(I|X) = (/dI)p(Ĩ|X) = p(Ĩ|X)/(I ln 10).

Figure 2.

Figure 2

Current value as function of time for the nucleotide A. Note the large range of values. The current through the nucleotide is calculated each picosecond, but some data points are missing due to lack of convergence in the calculation (see SI for details).

Figure 3.

Figure 3

Histograms of the probability distributions p(Ĩ|X) for the log-current values Ĩ = log10(I/Amp) for the four different nucleotides (same as figure 2 in reference [29]). Dashed vertical lines mark points on the current axis where one distribution replaces another at being the one with the highest probability density. The colored arrows show the ranges, DX (X ∈ {A, T, G, C}), of current values in which nucleotide X is indicated by a single measured current value (m = 1, where m is the number of current measurements).

Notice that the current distributions span six orders of magnitude; from 10−15 Amp to 10−9 Amp (see figure 3). Table 1 shows the corresponding expected values μX and standard deviations σX for the current probability density distributions of figure 3. In experiments with mechanically controlled break-junctions, the transverse current signal from individual nucleotides was in the range ~1–100 pA [8] and thus comparable to the expected values of the simulated currents, such as those shown in figure 2.

Table 1.

Expected values μX and standard deviations σX of the current for the four nucleotides X ∈ {A, T, G, C} for the current distributions shown in figure 3. The correlation time and weight factors are from fits of the experimental periodigrams to the theoretical power spectrum corresponding to the autocovariance stated in equation (1). Error bars on w0,X are less than 5% of the fitted values and thus not stated.

X μX [pA] σX [pA] τX [ps] w0,X
A 48 41 44 ± 5 0.70
T 0.30 0.73 80 ± 40 0.92
G 4.0 3.2 60 ± 20 0.85
C 1.3 2.0 14 ± 7 0.94

In the simulations, the contacts to the nucloetides are modelled as gold electrodes [7,20]. Due to the presence of water, the tunneling barrier is considerably reduced: to about 1 eV from the gold work function of about 4.5 eV. Other electrodes, such as Pt, can be (and are currently) used in experiments without much qualitative change in the distributions. For a detailed discussion of the current calculations, see references [6,7,20].

We next take advantage of the simulation times up to 1500 ps in the simulations of individual nucleotides in the nanopore. Although it is not possible to reach the experimentally relevant sampling times, which are of the order of micro- or miliseconds (see below), we can extract the relevant time scales without approximate solutions for times longer than picoseconds [30].

Current values calculated at different time points are not independent, and the correlations in the signal are quantified with the autocovariance RXcurr(k,)(IkXμX)(IXμX), where Ik is the simulated current at the time point tk = kΔtcurr with Δtcurr = 1 ps §. Figure 4 shows the autocovariance for the nucleotide A. The autocovariance is consistent with a process with two time-scales,

RXcurr(k,)=σX2(w0,Xδk,+[1w0,X]e|k|Δtcurr/τX), (1)

where σX2 is the total noise-variance. The first term in equation (1) describes the total contribution from all processes with correlation times much shorter than the time between recordings, Δtcurr = 1 ps, i.e., correlation times too short to be resolved. The second term is exponentially decreasing with a characteristic time-scale τX. Fitted values for the parameters of RX (k, ℓ) are given in table 1 for all four nucleotides. The parameter w0,X is the weight factor for processes with correlation times too short to be resolved. It falls in the range from 0.70 to 0.94. Thus most correlations are too brief to be resolved, probably due to reorientation of the water molecules in the solvent, which happens on a time scale of tens of femtoseconds. The longer-lasting correlations decrease exponentially in time with a characteristic time scale τX in the range 14–80 ps. Correlations in the current on the longer time-scale are most likely caused by the motion of the nucleotide between the electrodes.

Figure 4.

Figure 4

Autocovariance of the current values shown in figure 2. The black curve shows the exponential decrease for time lags τ larger than 1 ps. Its characteristic time is τA = 44 ps. Notice that the black curve is not a fit to the data shown, because these data values are autocorrelated. Instead, the parameter τA of the exponential autocorrelation function was determined by fitting the Fourier transform of the autocorrelation function to the power spectrum of the data shown here (Wiener-Khinchin theorem; see SI for details).

Figure 5 shows a schematic of the time scales in the simulation. These are: the time step in the MD-simulations, ΔtMD = 1 fs, the time interval between consecutive recordings of the current Δtcurr = 1 ps, and the correlation times in current traces τX ~ 40–70 ps. Furthermore, for a current of 1 pA, the average waiting time, Δtwait, between electrons is ~ 0.1 µs; more than a 1000 times longer than the correlation time. Consequently, the measured currents are not affected by the thermal motion of the molecule, and the correlations in the calculated current signal cannot be measured experimentally. We elaborate on this finding in the next section.

Figure 5.

Figure 5

Schematic of the various time scales in the simulations of transverse tunneling currents through nucleotides. The time scales are the time step in MD-simulations ΔtMD, the time interval between consecutive recordings of the current Δtcurr, the correlation times in current traces τ, the average waiting time between electron tunneling Δtwait, and the sampling time in an experiment Δts.

3. Connecting simulated current values with experimental recordings

A current measured experimentally cannot be detected instantaneously but requires that the number of electrons passing through a surface is recorded over a finite time interval. That is, the current Imes,iX measured at discrete time points ti = iΔts is the number of electrons N passing through the nucleotide X from time ti−Δts to ti divided by the length of the interval (Imes,iX=N/Δts). Here, we argue that the uncertainty in the measured current is caused by two effects. The shot noise due to the discreteness of electrons, and the correlation time between current values. For the simulations considered here, we demonstrate that the uncertainty in the measured current is dominated by shot noise.

First, the low current values (fA to nA) set a lower limit on the experimental sampling time Δts. With the assumptions of an ideal detector and no correlations between events of electron tunneling, the latter events satisfy Poisson statistics, so a recording with an expected value of 〈N〉 electrons in the time interval Δts will have a relative uncertainty on the number of measured electrons of 1/N. This uncertainty is due to shot noise.

Suppose we aim for an uncertainty of 3%, which requires 〈N〉 = 1000. A current signal of the order of picoamperes corresponds to an expected value of approximately 107 electrons passing through the nucleotide per second. Thus, a measurement time of approximately 10−4 s = 0.1ms is needed to detect 1000 electrons on average. A sampling time Δts = 0.1ms gives a sampling frequency fs = 1/Δts = 10 kHz. Similarly, detection of currents in the nanoamp-regime requires sampling frequencies of at most MHz. Higher sampling frequencies require larger currents. Thus, it seems of questionable relevance to analyze simulated current spikes with durations down to a few picoseconds and a current signal in the nanoamp-regime. Increased sampling frequency also leads to increased thermal noise, as we discuss in section 4.

Next, we consider the uncertainty in the measured current due to its auto-correlated variation caused by the thermal motion of the nucleotide. Mathematically, the current value Imes,iX recorded for nucleotide X and associated with the point in time ti = iΔts is

Imes,iX=1ΔtstiΔtstidtIX(t), (2)

where IX (t) is the steady-state current for the configuration of the system at time t (see above). As IX (t) is fluctuating, the measured current Imes,iX is a stochastic variable. It can be characterized by its expected value and its standard deviation. The expected value of the measured current is Imes,iX=IX(t)=μX. The standard deviation of the measured current depends on the correlations in the current due to the dynamics of the molecule itself and the motion of the surrounding water molecules. With the autocovariance defined in equation (1), the variance of the measured current is (for details, see SI)

σmes,X2(Imes,iXμX)2σX2[ΔtcurrΔtsw0,X+(1w0,X)2τXΔts](1w0,X)2τXΔtsσX2, (3)

where we used in the last two lines that the sampling time is much longer than the correlation time Δts ≫ τX ≫ Δtcurr. For a sampling frequency of 10 kHz and a correlation time of, say, 50 ps, the prefactor is 2τXts ~ 10−6. So the standard deviation of the measured current is σmes,X ~ 10−3σX, which for the present data is of the order of, or less than, femtoamps. That is, the relative uncertainty of the measured current due to the thermal motion of the nucleotide is σmes,XX ~ 10−3, which is much lower than the relative uncertainty due to shot noise. Thermal motion of the nucleotide thus does not affect experimental measurements.

According to table 1, the minimum distance between the expected current values |μX − μX′| for XX′ is approximately 1 pA; much larger than σmes,X, or, e.g., the 3% relative uncertainty caused by shot noise for 〈N〉 = 1000. Consequently, an ideal measurement could easily distinguish between the four types of nucleotides as the distributions of the measured currents are nonoverlapping. That is, neither the configurational changes of the nucleotide and the surrounding water molecule nor the shot noise can explain the overlapping distributions seen in experiments. Furthermore, an ideal experiment would only be able to estimate the expected values, μX, of the simulated current probability distributions in figure 3, not the actual shapes of the distributions.

Finally, we notice that even though the molecule in the simulation goes through many different configurations during a given measurement, we do not know how much of its phase space is sampled. The molecule could be trapped in a local minimum and only sample a fraction of all possible minima. Therefore simulations should be performed for different initial configurations, and the dependence on the initial conditions should be investigated.

In section 4 we discuss the role of the thermal noise and in section 5 how filters change the current distributions for the case where the width of the distributions are not made negligible by the time-averaging in equation (2).

4. Experimental noise

Noise is unavoidable in real measurements. It causes current distributions to overlap and must be accounted for in order to avoid ambiguous classifications of the signal. Previous work has characterized the noise in the ionic current through a solid-state nanopore in a SiN membrane [31] and through graphene nanopores [32]. Both cases show a 1/f-distribution at low frequencies. Reference [33] characterized the noise in the voltage across a gold-wire break-junction in vacuum at room temperature. Both in the presence and absence of a molecule in the junction, at high frequencies the power spectrum of the voltage is identical to the spectrum of thermal (Johnson-Nyquist) noise.

Thermal noise is inevitable in electronic circuits and is due to the thermal voltage fluctuations in a resistor [34]. It causes a Gaussian distributed white noise with standard deviation

σth=4kBTΔfR. (4)

Here, Δf is the frequency bandwidth within which the current is measured, and R is the resistance of a load resistance put in series with the molecular junction. Notice in particular how a decreased sampling time increases the thermal noise if the total measurement time tmsr is kept unchanged (Δf = fNyq − 1/tmsrfNyq = 1/(2Δts)). Equation 4 describes a system in equilibrium, while the noise increases if a DC voltage is applied. For measurements with nanogaps in a liquid environment, the standard deviation of the measured background signal was 10 pA for a load resistance of 10 kΩ and a bandwidth Δf ≃ 1/(2Δts) = 0.5 kHz [8]. Thus the estimate for the standard deviation of the thermal noise before filtering is ~ 30 pA. Electronic lowpass filters reduce this noise amplitude, however (see section 5).

Figure 6 illustrates this situation with normal distributions with expected values given by μX in table 1 and with standard deviations σnoise = 5 pA. That is, we assume that the noise is normal distributed and added to the signal from the molecule. The distributions show clear overlaps for X = T, G, and C, as σnoise is larger than the distance between the expected values. Current signals from the base A are well separated from the other values, making this nucleotide easily distinguishable. We use the distributions in figure 6 when we discuss nucleotide assignment and the corresponding error rates in sections 6 and 7, respectively.

Figure 6.

Figure 6

Illustration of the distributions of the measured currents for the four different nucleotides. The expected values are taken from table 1 and the widths are due to an added experimental noise with vanishing expected value and standard deviation σnoise = 5 pA. As σnoise is larger than the expected value of the current for the nucleotides T, G, and C, negative current values occur for these nucleotides.

5. Influence of electronic filters

Electronic lowpass filters are indispensable for measurements of small currents. They reduce the noise in measurements, but they also modify the shape of spikes in the signal. This effect is well-studied for the higher-order Bessel filters often used in patch-clamp techniques [35] and in measurements of the ionic blockade in nanopores [36] (see, e.g., references [35] and [37] for an introduction to random data and filters). Filters also change the distribution of the measured current values, which must be considered when comparing measured and simulated currents (figure 7). Finally, filters introduce autocorrelations in the signal. An autocorrelated time series of current measurements contains less information than an uncorrelated series with the same variance, and thus gives higher error rates for the nucleotide assignment. The latter point is addressed in section 7.

Figure 7.

Figure 7

Effect of filtering. The continuous lines (reproduced from figure 3 for convenience) show probability distributions of simulated currents. The dashed lines show probability distributions of filtered simulated currents (first-order filter with fc = fNyq/4).

Linear filters change an incoming signal by outputting a weighted sum over input values. Described in continuous time,

Iout(t)=dth(tt)Iin(t), (5)

where Iin/out is the current before and after the filter, respectively, and the weight factor h(t) is the filter’s transfer function. For a causal system h(t) = 0 for t < 0. The Fourier transform of the transfer function is the frequency response function H(f). Since a factor 2 is very nearly 3 dB, the frequency at which |H(f)|2 = 1/2 is denoted by f3dB. It is also called the critical frequency and denoted by fc. In experiments, fc-frequency is typically chosen as a fraction of the Nyquist frequency fNyq = 1/(2Δts).

A discrete linear filter relates discrete inputs to outputs as

Iout,i=j=hijIin,j. (6)

As an example, we here consider a simple first-order filter (0 ≤ α ≤ 1),

Iout,i=αIin,i+(1α)Iout,i1. (7)

Here the output at a given point in time is the weighted sum of the simultaneous input, Iin,i, and the output at the previous point in time, Iout,i−1. Iteration of equation (7) gives the weight factors of the filter: hj = α (1 − α)j = αej ln(1−α) = α ejΔtsc for j ≥ 0 and zero otherwise, i.e., the output is an exponentially weighted superposition of the current and all past inputs. The characteristic time scale is τc = −Δts/ ln(1 − α), and the characteristic frequency is fc = 1/(2πτc).

Now consider an uncorrelated input signal with μ the average current and σin2 the variance of the input signal, i.e., (Iin,iμ)(Iin,jμ)=σin2δi,j. With equation (6) and the definition of the exponential filter, the autocovariance of the output current follows,

Rout(i,j)(Iout,iμ)(Iout,jμ)=σout2e|ij|Δts/τc. (8)

Here we have introduced σout2=σin2α2α. The first-order filter thus gives an exponentially decreasing correlation function and lowers the value of the total variance. We use this expression for the correlation function in section 7, where we calculate the error rates for nucleotide assignment for correlated data.

The distribution of the recorded output relative to the input is also changed by filters. Assume it were possible to measure the current values in figure 3 with a sampling time as brief as the time between recordings, i.e., with Δts = Δtcurr. Assume also absence of intrinsic correlations (w0,X = 1) and a simple first-order filter with critical frequency fc = fNyq/4, i.e., with characteristic time scale τc ≃ 1.27Δts+. Then the distribution of the sampled current values would follow the distributions shown with dashed lines in figure 7. The filtered distributions are smoother than the original ones, and the standard deviations are reduced [see text below equation (8)]. In the limit of very long characteristic times, τc ≫ Δts, the distributions approach normal distributions by force of the central limit theorem. These effects are important to keep in mind when comparing simulation results with experimental data, as the comparison must take into account the distortion of experimental distributions by filters. This could be relevant, e.g., for simulations of the current through a nanoribbon with nucleotides passing through a hole in it. Simulations show an overlap for different nucleotides [22], but electronic filters will decrease these overlaps.

Finally, the autocovariance of experimental data is often affected both by the physical processes in the measured device and by filters in the data acquisition electronics [31, 32]. If the autocovariance can be determined experimentally, it can serve as input for the covariance matrix used when estimating the error rates.

6. Nucleotide assignment using maximum likelihood and error rates

Classification of output from biosensors (and sequencers) is often ambiguous because output values contain a stochastic element. When probability distributions for output values overlap, one cannot tell from a single measurement which input caused the output. For experimentally measured current signals the assignment is often further complicated due to, e.g., a varying background signal. The classification problem can then, e.g., be addressed by machine learning techniques, like Support Vector Machine (SVM) [30,38]. For simulated data with a stable background and with the current distributions for the different molecules available, we suggest to use the maximum likelihood decision rule for nucleotide assignment as it is a straightforward and standard procedure [39]. In addition, it is easy to simulate the corresponding error rates without any adjustable parameters. In the assignment procedure, the influence of time averaging, experimental noise, and correlations in the signal are included. We give here a basic vocabulary for the problem of how to assign a nucleotide to a given current signal; for a detailed introduction to pattern classification, see, e.g., reference [39].

As an example, we use the four different types of nucleotides X ∈ {A, T, G, C} and their four associated distributions of values for the transverse tunnelling current. Let

ImX=(I1X,I2X,,ImX) (9)

denote the time series of m current measurements. All current values InX stem from the same nucleotide, so we drop the superscript X from now on. Notice that it is assumed that the probability distribution of current values is known for each nucleotide. So given a current signal Im = I consisting of m measurements, the task is to give an algorithm for how to assign a specific type of nucleotide to the current signal and to determine the error rate, i.e., the relative frequency with which the assignment is incorrect.

The current signal I is our observation. It stems from one of the four types of nucleotides X ∈ {A, T, C, G}. The variable X denotes the ‘state of nature’. Let P(X) denote the a priori probability for the nucleotide being X. How probable it is to observe the signal I, will depend on the ‘state of nature,’ the value of X. So we introduce the class-conditional probability distributions p(I|X). For our problem, these functions are the probability distributions for values of currents (see figure 3), and they are known a priori from the simulations. If we assume that the priors P(X) are also known, Bayes’ formula states that the relation between the prior and the posterior probabilities, i.e., the probability that the ‘state of nature’ is X given the observation I is

P(X|I)=p(I|X)P(X)Xp(I|X)P(X). (10)

Notice the normalization condition ΣX P(X|I) = 1. Here, we also follow the convention in reference [39] and let the probability functions over discrete and continuous sets be denoted by upper-case P and lower-case p, respectively.

We need a decision rule to decide which ‘state of nature’ the system was in when it produced the current signal I. It can be shown that the decision rule which minimises the error is Bayes’ Decision Rule [39], which amounts to choosing the ‘state-of-nature’ X with the highest a posteriori probability P(X|I). If we have no prior information about the molecules, it is reasonable to assume that they all have the same a priori probability P(X) for all X. This gives the maximum likelihood decision rule, which is to choose the X which maximizes the likelihood p(I|X), i.e.,

decide X if p(I|X)>p(I|X) for all XX. (11)

This is the decision rule we will use below. Notice how the decision rule divides the m-dimensional space for the observable I into different domains DX, where DX is the domain where we choose X, i.e., DX = {I | p(I|X) > p(I|X′) for all X′X}. This can also be expressed as an indicator function 1DX (I) with the properties 1DX (I) = 1 if p(I|X) > p(I|X′) for all X′X and 0 otherwise.

The different domains DX are simple to illustrate for the probability distributions in figure 3 for the m = 1 case of a single measurement, see the horizontal arrows in figure 3. The vertical dashed lines mark the intersections between the distributions. For general probability density distributions, the partition of the space of possible current values may be more complicated.

So far we have not specified how to calculate the class-conditional probability density function p(Im|X), but we return to this issue in section 7.

The easiest way to find the error rate is to calculate the probability Pcorrect,mX of a correct assignment for the nucleotide X, and then find the error rate as emX=1PcorrectX. The probability of being correct can be expressed as the probability that the ‘state of nature’ is X and I is in DX, i.e., [39]

PcorrectX=XP(IDX|X)P(X)=XDXp(I|X)P(X)dI=X1DX(I)p(I|X)P(X)dI (12)

Here, 1DX is an indicator function that is specified above for the maximum likelihood decision rule, although other possibilities exist [39].

Given a set of probability distributions p(I|X) and a partition DX dividing the range of outcomes for I, error rates can be calculated by direct evaluation of the m-dimensional integral in equation (12), e.g., by Monte Carlo integration [40]. Often it is much easier to Monte Carlo simulate the error rates, which is done separately for each type of nucleotide Xchosen. In case of m measurements, the procedure is:

(i) From the current probability distribution p(Im|Xchosen) draw m independent current values Im, (ii) calculate for all four nucleotides the conditional probability density p(Im|X), (iii) assign to the current sequence Im the nucleotide Xassigned with the highest conditional probability density p(Im|X), and finally (iv) record whether the chosen nucleotide Xchosen is identical to the assigned nucleotide Xassigned. Steps (i)-(iv) are repeated many times.

The error rate emX is simply the relative frequency with which a different nucleotide is assigned to a current sequence produced by the nucleotide Xchosen. The weighted average of the error rates is

em=1Pcorrect,m=XemXP(X), (13)

where P(X) is the prior for the nucleotide of type X.

In section 7 we demonstrate how to calculate the error rates of the nucleotide assignment for the distributions in figure 6 when the current measurements are correlated by first-order filtering.

7. Error rates for correlated data

Assignment of nucleotides and the corresponding error rates depend on the class-conditional probability density function p(Im|X), i.e., the probability to measure the set of current values Im for given nucleotide X. We argued above that both physical processes and electronic filters introduce correlations in the measured signal. We here demonstrate how the correlations influence the error rates for the nucleotide assignment.

For the sake of simplicity, we assume that the measurement noise is normally distributed as it is, e.g., for thermal noise. Then the probability density function p(Im|X) is given by the multivariate normal distribution

p(Im|X)=1det (2πΣX)×exp (12[ImμX]TΣX1[ImμX]). (14)

Here μX is an m-dimensional vector with identical elements μX, and ΣX is the (positive definite) m × m-covariance matrix ΣX,ij = R(i, j), i, j = 1, 2,…, m, where R(i, j) is the autocovariance. Notice that if the current values are independent and identically distributed, the covariance matrix is a diagonal matrix with the variance of the distribution on the diagonal, ΣX,ij=σX2δij. Then the expression in equation (14) reduces to the product form p(Im|X)=n=1mp(In|X)=n=1m12πσx2exp[(InμX)2/(2σX2)].

As an example, we consider the case where the autocovariance is identical for all four nucleotides, and the autocoavariance matrix is ΣX,ij=Σij=σnoise2e|ij|Δts/τc. This corresponds to the output from a first-order filter with a characteristic time scale τc, given a white-noise input. The characteristic time scale is again chosen such that it corresponds to a first-order filter with a critical frequency fc=12πτc=fNyq/4, i.e., τc ≃ 1.27Δts. For the current distributions shown in figure 6, we then simulate the assignment of nucleotides as described above with the use of equation (14). Finally, we calculate the error rates for the individual nucleotides, emX, and the average error rate, em, from equation (13)*. The error rates versus the number of measurement are shown as dashed lines in figure 8. Full lines are the results for independent measurements, all with the same total noise variance, i.e., Σij=σnoise2δij. Error rates are higher and decay slower for correlated than for independent measurements, since correlated data contain less independent information. Error rates for a Gaussian filter with the same critical frequency and using the same noise variance are found in SI. The results are very similar as those for a first-order filter with the same characteristic time-scale.

Figure 8.

Figure 8

Error rates emX=1PcorrectX versus the number of measurements m for the distributions T, G, and C in figure 6 (error rates for the nucleotide A are less than 0.01% for all m and thus not shown). Full lines show the error rates for uncorrelated data, while the dashed lines show error rates for data filtered through a first-order filter with a critical frequency fc = fNyq/4. Notice that the total noise variance is Σii=σnoise2 for both the correlated and uncorrelated data. The weighted average, em, of the error rate over all four nucleotides [equation (13) with P(X) = 0.25 for all X] is shown with magenta lines.

These findings stress the importance of including correlations in the algorithms for nucleotide assignment or step detection in experimental signals. The version of the step-finder algorithm CUSUM used for detection of multi-level events in nanopore translocation experiments [17] assumes a signal consisting of independent data points, although this condition is not fulfilled by the experimental data. The assumption might influence the results of the nucleotide assignment and the corresponding error rates; especially for high noise levels and small level separations of the expected current values for the different nucleotides.

The duration of the time a nucleotide spends between the electrodes determines the number of measurements done on it. Typically this cannot be easily controlled experimentally as the detachment of the nucleotides from the electrodes is a stochastic process, and the distribution of durations often is rather broad. For GMP molecules in a break-junction, the duration in the gap was in the interval from 1 to 100ms and showed a dependence on the applied bias [8]. For a sampling frequency of 1 kHz, it corresponds to up to 100 measurements at the electrodes. The duration the target molecule spends at the electrodes can be increased by functionalizing the junction, which gives durations up to a second [13,14,38,41]. Thus the relevance of theoretical proposals for sequencing or biosensing depends both on the decrease of error rates with the number of measurements and on the four distributions of time spent by Molecule X between the electrodes.

8. Discussion

The present study emphasizes that the very weak transverse tunneling currents require experimental current measurements with long integration times, and it describes the consequences of a long integration time for the measured currents. These considerations are relevant not only for sequencing with fixed electrodes but also for simulations of nanopore sequencing of single-stranded DNA with graphene nanoribbons [27] and for recognition tunneling [30].

One consequence of the long integration time is that only the expected value of the current is probed experimentally, because the required integration time is very much longer than the autocorrelation time of current fluctuations caused by the nucleotide’s thermal motion. Thus, a current measurement averages over so many different orientations of the nucleotide in the gap junction that the resulting current value is a thermal average with no dependence on nucleotide orientation. Consequently, different measurements with such long integration times should give very similar current values, i.e., values with a very narrow distribution on the current axis. Nevertheless, the full distributions of the simulated transverse tunneling currents are needed in order to determine their expected values. This is because the simulated current values for each nucleotide span almost three orders of magnitude due to the thermal fluctuations of the molecule in the nanogap. So it is not sufficient to calculate the tunneling current for only a few fixed configurations of a nucleotide. This can lead to incorrect values for the current’s expected value.

Secondly, in the original simulations of transverse tunneling through nucleotides, the electron transport was described as coherent tunneling [6, 7]. A later simulation included dephasing of the tunneling electrons due to the fluctuations of the molecule and its environment. These effects changed the distribution of the simulated current values [20]. For experimentally relevant values of this dephasing, it caused a slight downward shift in the expected value of the current. It also slightly changed the shape of the current distribution. The shift might be detectable in experiments, but the change of shape is washed out by the long integration time required in real experiments.

We also addressed how to assign a nucleotide to a measured current signal with the maximum likelihood decision rule. The general challenge for the assignment is that the four different nucleotides have overlapping current distributions, broadened by electronic noise in the data acquisition system. Electronic sequencing would be easy without these overlaps: A single measurement of the transverse tunneling current would identify a nucleotide.

With some overlap, we can still distinguish between different nucleotides albeit with non-zero error rate. We just need to repeat measurements on the individual nucleotide several times to obtain a reliable result. We must, however, consider that electronic filters in the data acquisition system produce autocorrelations in the filtered signal. So although electronic filters are indispensable for measurements of small currents, their effect on the recorded current signal must be included in the data analysis, since filtering reduce the information content in the signal relatively to a signal with the same number of measurements but with independent data points.

The maximum likelihood framework for nucleotide assignment is easily generalized to more complicated setups than just a single pair of electrodes (see, e.g., the setup in [29]), or extended to include other types of information than just the measured current values. Other aspects that could help the identification could be, e.g., the duration of current spikes, the time interval between spikes, and the fluctuations of currents within spikes [30]. This extra information can be exploited in the assignment of a molecule to a recorded signal, if correlations between the measured quantities—e.g., the duration of a spike and its height—are correctly accounted for in the analysis.

Recently, it was investigated theoretically by simulations whether the use of multiple electrode pairs coupled in series could improve identification of nucleotides [29, 42]. The advantage of multiple electrodes is an increased number of measurements for each nucleotide and, consequently, a lower error rate. If the distribution of current values measured with each electrode pair is known, then the assignment procedure described above can be applied directly.

9. Conclusion

We have demonstrated the importance of realistic experimental integration times, of autocorrelation times in simulated current values, and of electronic noise and filters. Simulations must relate to real experimental measurements, obviously, in order to access the feasibility of theoretical proposals for real experiments. When the probability distributions of current values are known, which is the case for simulated data, we recommend using the maximum likelihood decision rule for nucleotide assignment, but also account for the correlations in the measured signal in order not to underestimate the error rates for the assignment.

Supplementary Material

NANO_28_1_015502_suppdata.pdf

Acknowledgments

The Center for Nanostructured Graphene (CNG) is sponsored by the Danish National Research Foundation, Project DNRF103. The research leading to these results has received funding from the European Union Seventh Framework Programme under grant agreement no. 604391 Graphene Flagship. PB and MD acknowledge partial support from the NIH-National Human Genome Research Institute.

Footnotes

In typical measurements of the ionic current through a nanopore, the current is in the range of hundreds of nA.

§

See SI for how to calculate the autocovariance from data.

The black curve in figure 4 is not obtained from a fit with the expression in equation (1), but from a fit to the corresponding power spectrum (see the SI for details).

A sampling frequency of 10 kHz is ten times the sampling frequency in the break-junction experiments in reference [8].

+

For a discussion of filter design and of how to choose the critical frequency, see, e.g., [35].

*

Multivariate normal distributions are built-in functions in, e.g., matlab.

References

  • 1.Branton D, Deamer DW, Marziali A, Bayley H, Benner SA, Butler T, Di Ventra M, Garaj S, Hibbs A, Huang X, Jovanovich SB, Krstic PS, Lindsay S, Ling XS, Mastrangelo CH, Meller A, Oliver JS, Pershin YV, Ramsey JM, Riehn R, Soni GV, Tabard-Cossa V, Wanunu M, Wiggin M, Schloss JA. Nat. Biotechnol. 2008;26:1146–1153. doi: 10.1038/nbt.1495. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Venkatesan BM, Bashir R. Nat. Nanotechnol. 2011;6:615–624. doi: 10.1038/nnano.2011.129. [DOI] [PubMed] [Google Scholar]
  • 3.Muthukumar M, Plesa C, Dekker C. Phys. Today. 2015;68:40–46. [Google Scholar]
  • 4.Heerema SJ, Dekker C. Nat. Nanotechnol. 2016;11:127–136. doi: 10.1038/nnano.2015.307. [DOI] [PubMed] [Google Scholar]
  • 5.Di Ventra M, Taniguchi M. Nat. Nanotechnol. 2016;11:117–126. doi: 10.1038/nnano.2015.320. [DOI] [PubMed] [Google Scholar]
  • 6.Zwolak M, Di Ventra M. Nano Lett. 2005;5:421–424. doi: 10.1021/nl048289w. [DOI] [PubMed] [Google Scholar]
  • 7.Lagerqvist J, Zwolak M, Di Ventra M. Nano Lett. 2006;6:779–782. doi: 10.1021/nl0601076. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Tsutsui M, Taniguchi M, Yokota K, Kawai T. Nat. Nanotechnol. 2010;5:286–290. doi: 10.1038/nnano.2010.42. [DOI] [PubMed] [Google Scholar]
  • 9.Ohshiro T, Matsubara K, Tsutsui M, Furuhashi M, Taniguchi M, Kawai T. Sci. Rep. 2012;2 doi: 10.1038/srep00501. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Tsutsui M, Matsubara K, Ohshiro T, Furuhashi M, Taniguchi M, Kawai T. J. Am. Chem. Soc. 2011;133:9124–9128. doi: 10.1021/ja203839e. [DOI] [PubMed] [Google Scholar]
  • 11.Ohshiro T, Tsutsui M, Yokota K, Furuhashi M, Taniguchi M, Kawai T. Nat. Nanotechnol. 2014;9:835–840. doi: 10.1038/nnano.2014.193. [DOI] [PubMed] [Google Scholar]
  • 12.Zhao Y, Ashcroft B, Zhang P, Liu H, Sen S, Song W, Im J, Gyarfas B, Manna S, Biswas S, Borges C, Lindsay S. Nat. Nanotechnol. 2014;9:466–473. doi: 10.1038/nnano.2014.54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Chang S, Huang S, He J, Liang F, Zhang P, Li S, Chen X, Sankey O, Lindsay S. Nano Lett. 2010;10:1070–1075. doi: 10.1021/nl1001185. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Huang S, He J, Chang S, Zhang P, Liang F, Li S, Tuchband M, Fuhrmann A, Ros R, Lindsay S. Nature Nanotechnol. 2010;5:868–873. doi: 10.1038/nnano.2010.213. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Pang P, Ashcroft BA, Song W, Zhang P, Biswas S, Qing Q, Yang J, Nemanich RJ, Bai J, Smith JT, Reuter K, Balagurusamy VSK, Astier Y, Stolovitzky G, Lindsay S. ACS Nano. 2014;8:11994–12003. doi: 10.1021/nn505356g. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Xie P, Xiong Q, Fang Y, Qing Q, Lieber CM. Nature Nanotechnol. 2012;7:119–125. doi: 10.1038/nnano.2011.217. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Raillon C, Granjon P, Graf M, Steinbock LJ, Radenovic A. Nanoscale. 2012;4(16):4916–4924. doi: 10.1039/c2nr30951c. [DOI] [PubMed] [Google Scholar]
  • 18.Plesa C, Dekker C. Nanotechnology. 2015;26:084003. doi: 10.1088/0957-4484/26/8/084003. [DOI] [PubMed] [Google Scholar]
  • 19.Zwolak M, Di Ventra M. Rev. Mod. Phys. 2008;80:141–165. [Google Scholar]
  • 20.Krems M, Zwolak M, Pershin YV, Di Ventra M. Biophys. J. 2009;97:1990–1996. doi: 10.1016/j.bpj.2009.06.055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Nelson T, Zhang B, Prezhdo OV. Nano Lett. 2010;10:3237–3242. doi: 10.1021/nl9035934. [DOI] [PubMed] [Google Scholar]
  • 22.Saha KK, Drndić M, Nikolić BK. Nano Lett. 2012;12:50–55. doi: 10.1021/nl202870y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Ahmed T, Kilina S, Das T, Haraldsen JT, Rehr JJ, Balatsky AV. Nano Lett. 2012;12:927–931. doi: 10.1021/nl2039315. [DOI] [PubMed] [Google Scholar]
  • 24.Ahmed T, Haraldsen JT, Zhu JX, Balatsky AV. J. Phys. Chem. Lett. 2014;5:2601–2607. doi: 10.1021/jz501085e. [DOI] [PubMed] [Google Scholar]
  • 25.Farimani AB, Min K, Aluru NR. ACS Nano. 2014;8:7914–7922. doi: 10.1021/nn5029295. [DOI] [PubMed] [Google Scholar]
  • 26.Kim HS, Kim YH. Biosens. Bioelectron. 2015;69:186–198. doi: 10.1016/j.bios.2015.02.020. [DOI] [PubMed] [Google Scholar]
  • 27.Qiu H, Sarathy A, Leburton JP, Schulten K. Nano Lett. 2015;15:8322–8330. doi: 10.1021/acs.nanolett.5b03963. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Qiu H, Girdhar A, Schulten K, Leburton JP. ACS Nano. 2016;10:4482–4488. doi: 10.1021/acsnano.6b00226. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Boynton P, Balatsky AV, Schuller IK, Di Ventra M. J. Comput. Electron. 2014;13:1–7. [Google Scholar]
  • 30.Krstić P, Ashcroft B, Lindsay S. Nanotechnology. 2015;26:084001. doi: 10.1088/0957-4484/26/8/084001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Smeets RMM, Keyser UF, Dekker NH, Dekker C. Proc. Natl. Acad. Sci. U.S.A. 2008;105:417–421. doi: 10.1073/pnas.0705349105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Heerema SJ, Schneider GF, Rozemuller M, Vicarelli L, Zandbergen HW, Dekker C. Nanotechnology. 2015;26:074001. doi: 10.1088/0957-4484/26/7/074001. [DOI] [PubMed] [Google Scholar]
  • 33.Sydoruk VA, Xiang D, Vitusevich SA, Petrychuk MV, Vladyka A, Zhang Y, Offenhäusser A, Kochelap VA, Belyaev AE, Mayer D. J. Appl. Phys. 2012;112:014908. [Google Scholar]
  • 34.Kittel C, Kroemer H. Thermal physics. San Francisco: W.H. Freeman; 1980. [Google Scholar]
  • 35.Colquhoun D, Sigworth FJ. Fitting and statistical analysis of single-channel records. In: Sakmann B, Neher E, editors. Single-Channel Recording. Springer US: 1995. pp. 483–587. [Google Scholar]
  • 36.Pedone D, Firnkes M, Rant U. Anal. Chem. 2009;81:9689–9694. doi: 10.1021/ac901877z. [DOI] [PubMed] [Google Scholar]
  • 37.Bendat JS, Piersol AG. Random Data: Analysis and Measurement Procedures. 4th. Hoboken, N.J: Wiley; 2010. [Google Scholar]
  • 38.Chang S, Huang S, Liu H, Zhang P, Liang F, Akahori R, Li S, Gyarfas B, Shumway J, Ashcroft B, He J, Lindsay S. Nanotechnology. 2012;23:235101. doi: 10.1088/0957-4484/23/23/235101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Duda RO, Hart PE, Stork DG. Pattern Classification. 2Nd. Wiley-Interscience; 2000. [Google Scholar]
  • 40.Press WH. Numerical Recipes in Fortran 77 : The Art of Scientific Computing. 2nd. Cambridge England ; New York: Cambridge University Press; 1992. [Google Scholar]
  • 41.Lindsay S, He J, Sankey O, Hapala P, Jelinek P, Zhang P, Chang S, Huang S. Nanotechnology. 2010;21:262001. doi: 10.1088/0957-4484/21/26/262001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Ahmed T, Haraldsen JT, Rehr JJ, Di Ventra M, Schuller I, Balatsky AV. Nanotechnology. 2014;25:125705. doi: 10.1088/0957-4484/25/12/125705. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NANO_28_1_015502_suppdata.pdf

RESOURCES