Abstract
Nanopores are versatile single-molecule sensors that are being used to sense increasingly complex mixtures of structured molecules, with applications in molecular data storage and disease biomarker detection. However, increased molecular complexity presents additional challenges to the analysis of nanopore data including more translocation events being rejected for not matching an expected signal structure and a greater risk of selection bias entering this event curation process. To highlight these challenges, here we present the analysis of a model molecular system consisting of a nanostructured DNA molecule attached to a linear DNA carrier. We make use of recent advances in the event segmentation capabilities of Nanolyzer, a graphical analysis tool provided for nanopore event fitting, and describe approaches to event substructure analysis. In the process, we identify and discuss important sources of selection bias in that emerge in the analysis of this molecular system and consider the complicating effects of molecular conformation and variable experimental conditions (e.g. pore diameter). We then present additional refinements to existing analysis techniques, allowing for improved separation of multiplexed samples, fewer translocation events rejected as false negatives, and a wider range of experimental conditions for which accurate molecular information can be extracted. Increasing the coverage of analyzed events within nanopore data is not only important for characterizing complex molecular samples with high fidelity, but is also becoming essential to the generation of accurate, unbiased training data as machine learning approaches to data analysis and event identification continue to increase in prevalence.
Graphical Abstract

Solid state nanopores are a promising detection platform for a range of single-molecule biosensing applications.1 In their basic form, they consist of a single nanoscale hole through a thin dielectric membrane separating two electrolyte-filled reservoirs. When an external driving force (such as a voltage or pressure gradient) is applied across the membrane, a flow of ions through the pore is produced that can be measured as ionic current.2 Passage of a single molecule of interest through the pore can then be detected as a transient blockage in this ionic current, with the magnitude of the deviation from the open pore baseline at any moment relating to the local properties (size, shape, charge, etc.) of the part of the molecule that is in or near the pore at that time.3 In this way, the presence of specific biomolecules (e.g. nucleic acids, proteins) in a sample can be deduced from the appearance of their signatures in the electrical signal.1,4
Typically, to analyze nanopore data, a (small) set of descriptive statistics is extracted from fits to the ionic current signal of each translocation event (such as the total event duration and the maximum blockage depth) and the values of these parameters are plotted for all events in each experiment. Sub-populations of events are then assigned to target molecules based on clustering of these statistics. In recent years a handful of analysis tools5–8 (each with their own unique limitations on event fitting) have been proposed to facilitate such processing of raw nanopore data at a level that can suit most analysis needs. In practice, despite the availability of analysis tools that support deeper functionalities such as OpenNanopore,6 MOSAIC,5 Transalyzer,7 AutoNanopore,8 and others, analyte characterization is often limited to basic, single-valued metrics of their translocation signals and ignores information encoded in the internal structure of the events.
Distinguishing molecules with similar physical and chemical characteristics under this scheme can be challenging, however, as significant overlap can exist in the distributions of their extracted parameters. For example, the stochasticity in the transport of DNA can limit the ability of the nanopore to accurately separate DNA fragment sizes by passage time9–13 and, likewise, the different possible conformations molecules can take during translocation can limit the ability of the nanopore to identify molecules based on shape.14–16 For this reason, structured DNA polymers have been used to create barcode-like signals or more unique electrical signatures to improve the capability of the nanopore to distinguish molecules in a mixture or to multiplex detection during sensing. Applications in molecular data storage17–21 and target biomarker detection,22–27 for instance, have begun to use these structured polymers to encode nanopore-readable information about a sample.
When working with structured molecules of increasing complexity, a simple analysis workflow based on whole-event metrics (e.g. total event passage time or maximum blockage level) may not be sufficient to fully capture the information content of the signal, which is instead contained in localized features of each event, (hereafter referred to as “event substructure”), arising from specific local details of the molecule being sensed serially during passage through the pore. Furthermore, when selecting the subset of events that satisfy the expected event substructure associated with a given target molecule, many events (often >50%19) must be discarded as a result of conformational effects,28 fitting errors,5,29 or bandwidth limitations.30 This necessitates the use of longer sensing times and/or higher target concentrations in order to generate sufficient statistics for each structure variant in a given sample.
Focusing on a narrow window of total events in this way may also introduce undesirable biases into downstream analysis since there are often strong statistical correlations between the types of events that are discarded. This can significantly alter the statistical distributions of key parameters of interest from the ground truth when considering only the subset of events that pass these filters. For example, a common source of error in nanopore measurements is bandwidth limitation, which results in the distortion of events/subevents with short duration and translates to a higher probability of failed fits as sublevel durations are reduced.5 Removing badly fitted events in this case will cause any downstream analysis of the passage times of the molecules (or substructures of the molecules), for instance, to only use events with long sublevel durations, potentially leading to erroneous conclusions as this bias propagates into the overall distributions of these quantities.
Recently, the field of nanopore analysis has seen a significant increase in the use of machine learning models for event recognition.31–35 While these developments are exciting, the challenges involved in assembling appropriate training datasets are not often considered and are critical to sound model design. Most importantly, as machine learning approaches to data analysis and event identification become more common,31 any bias that exists in selecting events as the training data for the machine learning algorithms may ultimately be encoded in the model itself, propagating bias-induced errors forward anywhere that model is used. It is therefore critical that tools for event segmentation into model training categories be developed to assist in avoiding these biases at the level of training data curation.
To illustrate these analysis challenges when working with complex molecules, as well as to develop strategies and tools to address them, in this work we consider translocation events from a basic system of structured DNA molecules. The system is composed of two parts: a long, double-stranded piece of linear DNA of variable length (“carrier”) and a relatively short, bulky DNA nanostructure of variable size (“cargo”) that is reversibly attached to one end of the linear DNA through the hybridization of complementary DNA sequences. Here we will apply different analysis frameworks to segment our translocation events by both the carrier length and the type of attached nanostructure, and carefully identify and discuss errors that arise in this dataset curation process and how they can be mitigated.
As a practical application, such a structured molecular system could be used to translate a biomarker of interest into a more specific nanopore signal, for instance using toehold-mediated strand displacement reactions.36,37 In this scheme, a DNA oligonucleotide (“displacement strand”) that acts as a label for a particular biomarker is generated by an upstream assay step19,20,24,38,39 and then releases a cargo molecule from its attached carrier by binding to a complementary single-stranded extension (“toehold”) on one of the two pieces. If separate carrier length / cargo species pairs were each associated with a unique displacement strand sequence, then multiple biomarkers could be detected simultaneously in a single nanopore experiment, provided that the translocation events from these separate carrier-cargo pairs were distinguishable by their current signatures.
Importantly, the translocation events of these molecules contain essential features shared by many of those from other structured nanopore targets: all of the identifying information is encoded in the event substructure, in this case, local current spikes generated by the bulky nanostructure on top of an extended underlying signal from the DNA carrier, with variability in this signal arising from the multiple possible orientations/conformations the molecule can adopt prior to entering the pore. As such, this model system is representative of many nanopore sensing schemes currently in use in the field, and the analysis challenges we discuss are readily applicable to other, similar systems.19,27,40
In this paper we describe a process to analyze the nanopore signals of structured polymers and introduce refinements of this basic approach to better realize our goal of separating events from mixed samples. This process is centred around the use of Nanolyzer, a graphical analysis tool provided by Northern Nanopore Instruments for nanopore event fitting, segmentation, and event substructure analysis. We note, however, that the general approach to complex event analysis outlined in this work is compatible with any alternative software packages that support the core functions of: fitting event substructure, adding labels to subevents identified by the fits, and segmenting populations of events based on this labelled substructure, and therefore aims to be broadly integrable with existing experimental tools and applications in the field.
Initially, we sort events from a dataset with a single type of carrier-cargo molecule into categories that reflect the conformation of their parent molecules as they passed through the pore, using database query tools to segment events based on the fits. In the process, we discuss what additional information can be gained by breaking down subpopulations by the event substructure, as well as how variation in molecular conformation can impact both the spread of the total population in parameter-space and the rate of generating false negatives that are rejected during analysis due to fitting errors or unrecognized signal shapes. Following on this, we develop refinements of our analysis frameworks that allow for 1) improved separation of event populations by their carrier length and attachment type, 2) fewer events rejected as false negatives / less introduced selection bias, and 3) a wider range of compatible experimental conditions used to generate data with these molecules. Datasets from mixed samples featuring multiple carrier lengths or attachment types are used to compare the results of analyses performed with and without these analysis refinements in place.
RESULTS and DISCUSSION
Figure 1a shows a schematic of the molecular system used here. It consists of a variable length of linear double-stranded DNA (2-7 kbp, referred to as “carrier”) attached to a DNA multi-way junction (stars with 6 or 12 arms, 24-bp arm length, referred to as “cargo”) via complementary single-stranded extensions (12 nt) on both pieces – see Methods section for synthesis details. This cargo has been well characterized in the past as a nanopore sensing target – mixes of stars with differing numbers of arms run on the same pore have previously been separated by the depth of the current blockages they generate.14,20 A typical current trace of translocating carrier + cargo molecules is presented in Figure 1b, as well as a magnified view of the current signature from the passage of a single molecule.
Figure 1:

a) Overview of DNA structures in “carrier” and “cargo” system. Illustrated here as a cargo example is a 6-arm star junction (24-bp arms) with a 12-nt ssDNA extension that can be hybridized to its complementary extension on a linear carrier molecule. b) Representative current trace of a sample of DNA carriers + cargo translocating though a nanopore under an applied electric field (2-kbp carrier DNA + 12-arm stars, ~11-nm pore, 3.6 M LiCl, 175 mV). Inset: Zoomed-in view of an individual event. c) 2D histogram of the maximum deviation from the baseline current for each event vs. its translocation time (log scale), using the same dataset as in (b). Population i: free cargos (12-arm DNA stars). Population ii: free carriers (linear 2 kbp dsDNA). Population iii: hybridized cargo-carrier molecules.
Once the ionic current data is recorded as a continuous timeseries, the next step is to use Nanolyzer to detect the translocation events within this data as statistically significant departures from the open-pore baseline current, as well as to extract numerical values of key descriptive statistics (“metadata”) for each event. For instance, Figure 1c shows a 2D histogram over two such statistics (using the analyzed events from same dataset as Figure 1b): “dwell time” and “maximum deviation”. Dwell time is simply the difference between the start and end times of the fitted event (signalled by the departure from, and return to the baseline current, respectively) as determined by the event-fitting algorithm. It characterizes the duration of an interaction between an analyte molecule and the nanopore. Maximum deviation corresponds to the single data point within the event that is furthest away from the open-pore baseline current and is calculated as the absolute difference between that point and the average value of the fitted baseline immediately preceding/following the event. For data featuring very short-lived current blockages within events, (as due to translocations of the small cargo molecules in our case, see sample event in Figure 1b), max deviation can often be a more useful measure of the magnitude of these blockages when compared to approaches that attempt to fit these fast current spikes to step functions, since this fitting tends to fail with increased frequency as the blockage durations get shorter5 (see also the discussion surrounding Figure 5, below).
Figure 5:

Gallery of several events from the dataset of Figure 4 that failed to be sorted into the categories of Figure 2 (“Rejected” row in Table 1) illustrating some of the reasons for this assignment. a) Cargo subevent missed. b) Cargo subevent fitted with multiple levels. c) Cargo subevent fitted with shallow level that will not be separable from folded carrier level. d) One of carrier levels not assigned into a carrier label (e.g. slightly deeper/shorter/etc. than bulk of other like sublevels). e) Event features short fitted initial level, perhaps due to access resistance blockage. f) One of carrier subevents features extra sublevel (arising perhaps from noise in signal or additional structure in physical conformation, e.g. knot/kink in backbone). g) Events lacking obvious cargo subevent (likely Type 0 events that were not removed in the initial max deviation threshold due to noise that exceeded this threshold in their deepest/folded subevent). h) Events that are not recognizable as resulting from simple conformations of our molecule of interest (may be due to contamination in the DNA sample, aggregates of multiple molecules, highly contorted individual molecules, etc.) Broadly speaking, (a) – (f) represent false negatives while (g) – (h) are examples of true negatives.
From the distributions in dwell time and max deviation of Figure 1c, at least three distinct populations of events are visible: i) short-lived, deep events (top left), likely due to the passage of cargo molecules (i.e. short but bulky star junctions) that are unattached to carriers, ii) longer events with two shallower blockage levels (bottom) likely corresponding to folded and unfolded conformations of free carrier molecules, and iii) events that are both long-lived and deep (top right) from the fully-assembled carrier-cargo molecule.
Rigid Event Classification
A proposed categorization scheme for the structured ionic current signals generated from hybridized carrier-cargo targets passing through a nanopore is presented in Figure 2. It is based on simultaneously grouping the events on two criteria that characterize the conformational state of the molecule as it translocates: 1) the folding state of the dsDNA carrier and 2) in what order the two ends of the carrier (free end and target-attached end) enter the pore. In this scheme, the carrier can be unfolded (“Type U”), partially folded (“Type P”) or fully folded (“Type F”), while the attachment can enter the pore first (“Type 1”), second (“Type 2”) or not at all (“Type 0”). Type 0 events are expected when a sample contains at least some carrier molecules with two free ends, that is, without an attached cargo. More complex event shapes (e.g. involving a doubly-folded or knotted carrier translocating the pore) have not been included in this categorization scheme and thus any events characterized by such shapes will be “rejected” by filters implementing it. However, based on previous results, simple DNA conformations featuring at most an initial single fold should represent the majority of translocation events under the experimental conditions tested here12,41,42 (double-stranded DNA, pore diameters < 20 nm), and no significant bias is expected to be introduced by ignoring events with more than one carrier fold.
Figure 2:

Schematic of an event categorization scheme for nanopore translocations of a bulky cargo attached to a linear carrier. Ionic current events are grouped as entries of a 2D matrix related to the underlying conformation of the translocating molecules that generated them (see main text for a description of the two criteria that create the axes of this grid). A visualization of the sublevel sequence of each event type is presented, with the longer, shallower levels representing fitted states of the linear carrier as it passes through the pore and the shortest, deepest levels representing rapid translocations of the bulkier cargo. Insets for each event type show an illustration of the molecular conformation right before translocation through the pore (bottom right), as well as a representative current trace from the dataset of Table 1 (top right) that was sorted into the given category with the sublevel labelling approach outlined in Figure 3.
The process for performing this categorization on a dataset of nanopore events is outlined in Figure 3. It starts with passing the raw or digitally low-pass filtered nanopore data to Nanolyzer to first locate the events within the ionic current timeseries and then to fit sublevels to the event substructure (relating to different possible physical substructures or configurations of the translocating molecule), based on criteria that can be adjusted in the software settings. The full configuration details of the analysis are provided in Supporting Information Section S1. The information from the output of this step on a single event is illustrated in Figure 3a and consists of sublevel duration, blockage depth, and time ordering, as well as approximately 30 additional metadata (see Supporting Information Section S2 for a full list of metadata used in this work).
Figure 3:

Outline of the process to assign labels to event sublevels, using an example dataset of a 12-arm star cargo attached to a 2-kbp dsDNA carrier and passed through a ~11-nm pore (3.6 M LiCl, 175 mV). a) Sample event showing five fitted sublevels (including the two at the beginning and end of the event, respectively, associated with the open baseline). b) Scatterplot of blockage depth and dwell time for all fitted sublevels in the dataset (across all events), clustered into four groups with Nanolyzer. Group 1 will be associated with unfolded sublevels in the carrier, Group 2 with folded carrier sublevels, and Group 3 (short, deep sublevels) with blockages from the attached cargo. Group 0, which includes short, shallow fitted noise spikes, 3× multiples of the unfolded level, etc. will not feature in the analysis of this sample and so events containing these sublevels will be discarded by a rigid event sorting approach. c) Original event in (a) with newly-assigned sublevel labels superimposed.
From there, all the sublevels found in the first step (across all translocation events in the dataset) can be grouped into clusters based on any combination of these metadata using the clustering module in Nanolyzer. For the purposes of this work, we use sublevel duration and sublevel blockage. The result is a numerical label assigned to each fitted sublevel that maps it to the cluster to which it was assigned, as seen in the example scatterplot in Figure 3b. This associates each event with an ordered series of sublevel labels that encode the event substructure.
Ideally, for the types of ionic current signatures being investigated here (generated by carrier-cargo molecules), separate clusters will emerge from this process corresponding to: 1) unfolded carrier blockages, 2) folded carrier blockages, and 3) cargo blockages. The carrier blockages, corresponding to folded and unfolded segments of linear dsDNA, can have a broad spread in sublevel durations (≳ 1 order of magnitude)10,42,43 and are typically observed with current blockage ratios ~2:1 when comparing folded (two DNA fragments in pore) and unfolded (one DNA fragment) blockages. Conversely, the DNA nanostructures comprising the cargo attachments are both bulkier (6+ dsDNA fragments) and shorter (~100×) than the carriers and so are expected to result in sublevels that are substantially smaller in duration and larger in blockage magnitude (see Figure 3b).
Once these cluster labels are defined, database filters based on SQL queries (“selection filters”) are written in Nanolyzer to extract only the subset of events that follow a specific sequence of sublevel labels. For instance, Type 2P events feature a sublevel sequence of ‘folded carrier’ → ‘unfolded carrier’ → ‘cargo’. This translates to a label sequence of ‘2’ → ‘1’ → ‘3’ in the example of Figure 3 (see Figure 3c). Selection filters corresponding to each of the event types of Figure 2 are thus written based on their associated sequence of sublevel labels and are applied to the dataset to separate events into categories to be analyzed separately. A full list of selection filters is provided in the Supporting Information Section S3.
We performed this categorization process on a large dataset (>10,000 events) generated from passing 12-arm star nanostructures attached to a 2-kbp dsDNA carrier through a ~13-nm pore (3.6 M LiCl, 75 mV) with a measurement bandwidth of 1 MHz (sampled at 4.17 MHz) and digitally low-pass filtered for fitting at 250 kHz. As an initial step, a selection filter was applied on the fitted events to remove those without a “large” max deviation (>800 pA), to look at only events featuring an attached star (remove Type 0 events as labelled in Figure 2, see Supporting Information Section S4). The remaining events were then sorted as outlined in Figure 3, with the resulting statistics presented in Table 1. The majority (57.7%) of total events were successfully binned into the categories of Figure 2, and manually inspecting the current traces of a sample of events in each category confirmed that this assignment was accurate (i.e. had a low false positive rate) (See Supporting Information Section S5 for the first ten events from each category). This also implies that 42.3% of deep-blockage events did not match an expected event substructure (“Rejected” entry of Table 1), and while this is a significant fraction, it is regretfully typical for experiments involving structured molecules (i.e. rigid selection filters can have a high false negative rate). A later section, “Rejected Events and Induced Biases”, will examine this subset of events in more detail.
Table 1:
Statistics of analyzed translocations grouped by event type for the dataset of Figure 4 (12-arm stars bound to 2-kbp dsDNA carrier, ~13 nm pore, 75 mV, 3.6 M LiCl). The tabulated entries of dwell time, ECD, and max deviation for each event type represent the peak values of Gaussian fits to histograms of these quantities, while the range above and below these peak values is taken from the standard deviation of the fits. Since the widely-spread dwell time and ECD distributions are fitted to log-normal functions, their range extends slightly farther for positive deviations than for negative deviations when mapped back to a linear scale (see Figure 4). Recall that Type U refers to unfolded carriers, Type P to partially folded, Type F to fully folded (i.e. capture in the middle of the carrier), while Type 1 refers to the attached cargo entering the pore first, and Type 2 after the carrier. Here Type 0 events (missing cargo) are filtered out.
| Event Category | Count | Percentage (%) | Dwell Time (μs) | ECD (fC) | Max Deviation (nA) |
|---|---|---|---|---|---|
| Type 1U | 1436 | 14.6 | 1.27 ± 0.43 | ||
| Type 1P | 1432 | 14.5 | 1.54 ± 0.44 | ||
| Type 2U | 1379 | 14.0 | 1.24 ± 0.37 | ||
| Type 2P | 1233 | 12.5 | 1.26 ± 0.44 | ||
| Type 1/2 F | 209 | 2.1 | 1.41 ± 0.55 | ||
| Rejected | 4169 | 42.3 | |||
| Total | 9858 | 100 |
Interestingly, there is a roughly equal distribution of the accepted events into each of the five categories, with the exception of Type 1/2F events which are sparsely distributed. This latter fact is perhaps to be expected, as it is unlikely that folded events will be folded exactly in the middle of the molecule within the time resolution of an experiment. In that sense, Type 1/2F events can be thought of as a mix of Type 1P and Type 2P events with exceedingly short, unfolded subevents that in the limit of infinite resolution could be properly sorted amongst those two categories. The roughly equal proportions of the remaining event types (Types 1U, 1P, 2U, and 2P) imply that there was no strong preference for these molecules to enter the pore in an unfolded or folded state, or with the star-end captured before the free end (or vice versa). Such trends are expected to depend strongly on the exact conditions of any particular experiment, however. For instance, shorter DNA carriers or smaller pores may reduce the degree of folding (promote Type U events over Type P) as the number of persistence lengths per molecule decreases or as the physical volume to accommodate multiple molecular fragments in the pore is reduced.13 Similarly, particularly bulky attached cargos (relative to the size of the pore) may have to be forced through by prior capture of the linear carrier, imparting a preference for Type 2 events over Type 1 events.25
Classifying by Dwell Time
Figure 4 shows how events from each subset are distributed over key metadata that will be relevant to our eventual goal of sorting mixtures of carrier lengths and attachment species by their current signatures. A scatter plot of max deviation over log[dwell time] for all events (colour-coded by event category) is presented in Figure 4a. From the dwell time distributions (Figure 4b), we see that fully unfolded events (Type U) pass through the pore in the longest amount of time on average, followed by partially folded events (Type P) and finally fully folded events (Type F). This makes sense if the impact of folding state on the electrophoretic velocity (e.g. due to interactions between the pore walls and the DNA as it passes through) is minimal here. Indeed, the more folded a molecule is, the shorter the distance between the segments of the molecule that first enter and last leave the pore, up to a factor of 2 when comparing fully folded and fully unfolded molecules, and we observed a factor of ~1.9 between the peak dwell times of Type U and Type F events of the current dataset (see Table 1).
Figure 4:

Distributions of extracted translocation parameters grouped by event type (see Figure 2 for definitions) for the dataset of Table 1 (12-arm star cargo bound to 2-kbp dsDNA carrier, ~13 nm pore, 75 mV, 3.6 M LiCl). a) Scatter plot of max deviation vs. log[dwell time] for all events, colour-coded by subtype (see legend overhead). b) Histograms of log[dwell time] for each subtype. c) Histograms of log[ECD] for each subtype. d) Histograms of max deviation for each subtype.
Additionally, within a given folding state of the carrier, the two subtypes (Types 1 and 2) – corresponding to the sequence of molecule ends that enter the pore – remain consistent in dwell time (see Table 1). Together with the tight correlation of carrier folding state and peak dwell time, this is another indication that the relative sizes of the pore and the target molecule used here were in a regime where interactions between the pore walls and passing DNA were negligible. For instance, comparing the shapes of Type 1P and Type 2P events in Figure 2, we see that the cargo sublevel is directly preceded by a folded carrier sublevel in the case of Type 1P events and an unfolded carrier sublevel for Type 2P events. This means that there is an additional fragment of dsDNA in the pore during the translocation of the (already bulky) 12-arm star cargo of Type 1P events vs. those of Type 2P events. However, this is not associated with a significant difference in dwell times between the two subpopulations that might result from extra friction during Type 1P translocations (see the discussion surrounding Figure 7, below, for examples with smaller pore sizes where pore wall/cargo interactions cannot be ignored).
Figure 7:

Using simple selection filters based on fitted carrier sublevels and the “local max deviation” of individual sublevels exceeding thresholds to extract Type 2 events from a complex dataset (6- or 12-arm star cargos attached to 2-kbp carriers, ~8-nm pore, 3.6 M LiCl, 200 mV). a) Schematic of the action of “Filter A” (see Table 2 for definition) on a typical event captured with this filter. Illustrated are the fitted sublevels of the event, the current threshold for the local max deviation to cross, and the sublevels over which this threshold is active. b) Schematic of the action of “Filter B” (see Table 2 for definition) on a typical event captured with this filter. c) 2D histogram of max deviation vs. log[ECD] for the output of Filter A. d) 2D histogram of max deviation vs. log[ECD] for the output of Filter B. e) 2D histogram of max deviation vs. log[ECD] for the output of the union of Filter A and Filter B. f) 2D histogram of max deviation vs. log[TrECD] for the output of the union of Filter A and Filter B. (See the discussion surrounding Figure 6 for a definition of “truncated ECD”).
One issue with the wide distribution of dwell times observed above (due at least partly to variation in the folding state of the carrier), is that it limits the resolving power of this parameter in our original application of separating target molecules on carrier length. For instance, we would expect a carrier molecule that was 1.9× longer in length to, on average, take a longer time to pass through the pore by roughly that factor. But as we saw with the data of Table 1/Figure 4 this effect might be exactly undone if comparing fully folded instances of the longer molecule with unfolded instances of the shorter molecule (compounded further by the fact that a greater proportion of folding is typically observed with longer molecules12). One solution might be to discard all of the folded events (~50% of events for the selected data above, and ~29% of the entire dataset) in each population but 1) this may not be practical for already small datasets and 2) the proportion of folded events is again not expected to remain fixed across carrier lengths – the data from longer carriers in a sample will likely be depleted to a greater extent. Rejecting folded events could therefore severely limit our ability to compare experiments and extract meaningful conclusions when using a variety of molecular lengths.
Classifying by Equivalent Charge Deficit
A potentially more flexible solution is to instead bin the events on “equivalent charge deficit” (ECD). Equivalent charge deficit is the integrated area under each event in a current vs. time plot, in between the blocked current values comprising the event and the open-pore baseline (Figure 6a later illustrates a sample calculation). It is the deficit of electric charge that would have otherwise passed through the pore as ionic current had the translocating molecule not blocked it. Folded segments of linear molecules thus contribute more to an ECD calculation compared to an equivalent unfolded segment through their higher current blockages but also less though their shorter dwell times (assuming similar passage speeds). If ECD is conserved for molecules of the same length but different folding states, the two effects perfectly cancel, within statistical variation.11,13,44
Figure 6:

Comparison of total equivalent charge deficit (ECD) and an optimized version of this quantity, truncated ECD (“TrECD”), for classifying populations of DNA carrier + cargo translocation events by carrier length. a) Illustration of an ECD calculation for a sample translocation event (2 kbp carrier + 12-arm star cargo, ~8-nm pore, 3.6 M LiCl, 200 mV), visualized as the shaded area between the data points comprising the blockages and the fitted baseline (open pore) current. b) Illustration of a TrECD calculation for the same event as in (a), incorporating only the sublevels that were associated to a blockage from the carrier during a clustering step (see Figure 3). Note that these sublevels enter the TrECD area calculation only through their single-valued, fitted current blockages (multiplied by their durations in time), rather than by the distance of every raw current point to the baseline as in (a). c) 2D histogram of max deviation vs. log[ECD] for translocation events from 12-arm star junctions attached to a mixture of carrier lengths (2- & 7-kbp, ~11-nm pore, 3.6 M LiCl, 200 mV). d) 2D histogram of max deviation vs. log[TrECD] for the same events as in (c). e) 2D histogram of max deviation vs. log[ECD] for translocation events from 2-kbp carriers attached to a mixture of star types (6- & 12-arm, ~8-nm pore, 3.6 M LiCl, 200 mV). f) 2D histogram of max deviation vs. log[TrECD] for the same events as in (e).
Figure 4c shows the ECD distribution for each subset of Table 1. The events from different subsets, corresponding to all the possible carrier folding states, are much more consistent in ECD, collapsing the separate dwell time distributions in Figure 4b to a shared ECD distribution in Figure 4c (compare also the two corresponding columns in Table 1). Now this collection of translocation events resulting from molecules of a single carrier length can be characterized by a single peak value relating to this length, making comparisons to other carrier lengths much easier. One thing to note is that this common peak ECD value for Types 1 and 2 events in the sample is slightly larger than that of Type 0 events in the same sample (analyzed separately, e.g. fC vs. fC for folded events with and without stars, respectively, see Supporting Information Section S6), which is related to the deep extra blockage present in Types 1 and 2 events from the attached cargo. The presence of a cargo can therefore influence the ECD value of a translocation event, which may impact the accuracy of this classification scheme. An approach to removing the influence of the cargo on carrier-specific length metrics will be discussed in a later section “Classifying by Truncated ECD”, below.
Classifying by Max Deviation
Finally, Figure 4d shows the distribution in maximum deviation for each subset (see also the corresponding column in Table 1). As a reminder, the data point corresponding to the max deviation in each event is expected to appear during the subevent associated with the translocation of the bulky attached cargo (assuming it is present, see Figure 2) and so indirectly characterizes the size of the attachment through how much current is blocked. Interestingly, three of the five event types (Types 1U, 2U, and 2P) share an approximately equal max deviation distribution, while that of Type 1P events is noticeably shifted toward deeper blockages. This can be taken as evidence of the extra carrier fragment alluded to earlier that is present in the pore during the passage of the cargo, blocking extra current. Type 1/2F events, which can be seen as a mixture of Type 1P and Type 2P events with unfolded subevents too fast to be resolved, have a peak max deviation that is intermediate between the two other subpopulations (compare the relevant entries for max deviation in Table 1), as might be expected. In the context of our original goal of separating molecular mixtures of carrier-attached cargos of different sizes by their current signatures, we see how the presence of Type 1P and Type 1/2F events could complicate this process by “artificially” broadening distributions that characterize their blockage depths, creating more overlap between distributions of different cargos. An approach of first grouping translocation events by molecular conformation (rather than just considering metrics of the entire global population) therefore has the advantage of separating some of the carrier versus cargo effects and provides a clearer picture of the true magnitude of the attachment blockage.
Rejected Events and Induced Biases
Current traces and their fitted sublevels for a few representative examples of the events that were not binned into the categories of Figure 2 (42.3%) are shown in Figure 5. Manually inspecting a sample of these rejected events, we note that almost all of them fall into rejection modes corresponding to recognizable cargo-carrier events (Figs 5a – 5f), with a majority of these involving unfitted (Fig. 5a) or overfitted (Fig. 5b) cargo sublevels. These are false negatives – easily human-recognizable as representing our molecule of interest, with subevents containing usable information that is currently being discarded with this rigid sorting approach due to fitting artefacts. This contrasts with events as in Figures 5g and 5h, which cannot easily be associated with a particular conformation of our intended target molecule. An important point to note is that if events from each of the categories of Figure 2 differ in their rates of generating false negatives, it will introduce biases into the statistics of the population (such as in Table 1). For instance, as discussed previously, the supposed “cargo” sublevel of Type 1P events (and some of Type 1/2F events) is actually the convolution of subevents from both the cargo and an extra carrier fragment translocating together. If the additional blockage from the carrier fragment results in a deeper or longer-lived “cargo” subevent, it may stand a better chance of being fitted to a sublevel and thus not falling into the rejection mode of Figure 5a when compared to its Type 2P analogue. Alternatively, deeper or longer-lived cargo subevents may be fitted with extra structure more frequently (beyond the single level expected by the selection filters of Figure 2), which is the rejection mode of Figure 5b. More generally, even among the events of a single type, if there is a correlation between the value of a parameter and the rate at which its parent event is rejected, the extracted values for the population will be biased in that direction rather than reflecting the entirety of the data. Understanding and limiting these biases is critical for nanopore sensing applications, especially those involving low numbers of total events and/or precisely quantifying the statistics of a parameter of interest (e.g. as in an assay for a low copy number target).
This issue of overzealous event rejection resulting from a rigid classification approach is expected to only become more problematic when sensing mixtures of target molecules with several cargo sizes. The dataset of Table 1/Figure 4 is one that is particularly amenable to such an analysis approach and is not necessarily representative of all datasets featuring combinations of DNA carriers and cargo molecules. First of all, it is large, containing ≳ 10,000 events for a single carrier length/cargo size pair. In other cases, such as when a pore features a low capture rate, or is rendered inoperable early on from a permanent clog, or is sampling many carrier + cargo pairs each at a reduced concentration, it may not be possible to reject a large fraction of the total events (~42% in the example above) and still be left with statistically valid counts in each category with which to draw conclusions.
More importantly, this dataset results from experimental conditions (pore size, cargo type, applied voltage, RMS noise vs. amplifier bandwidth, etc.) in which clustering sublevels and filtering events on sublevel sequence is successful in its simplest form. For instance, if a nanopore is “too big” relative to a particular cargo, the event substructure resulting from translocations of this attachment on the carrier may be too fast to reliably fit45, or feature attenuated blockage depths that are not easily separable from one another. On the other hand, if a nanopore is “too small”, passing cargos may interact strongly with its walls, resulting in long events with complex substructure that reflects cargo-pore interactions rather than molecular identity, dominating the event metadata46 (e.g. dwell time, ECD) and obscuring the subevent sequence that identifies the carrier conformation – see Supporting Information Section S9 for example current traces of carrier-cargo hybrids passing through a smaller (~8 nm) pore. Moreover, for a sample containing a mixture of cargo molecules, the size of the pore may represent one interaction regime for the smallest of the cargos (e.g. fast, shallow translocations) and a different one for the largest (e.g. slow, complex translocations). This means that if two carrier + cargo species are sensed by the same pore, it is very unlikely that similar rigid event filters applied to the translocation signals will give equal overall rejection rates (or equal weights to each rejection mode) in both cases, resulting in different imparted biases to the data from each species, and making comparison between experimental replicates difficult.
For these reasons, we next explore a refinement of the analysis approach that relies less on (inconsistent) sublevel fitting to the cargo subevents and develop additional event metrics that allow for the accurate characterization of a wider range of molecules with a given nanopore.
Refinements to Cargo-Carrier Event Characterization
As observed in the earlier ECD classification section, it is tempting to choose ECD as a metric to classify target molecules by carrier length, as this quantity is better conserved across folding states of a given carrier compared to dwell time (when carrier-pore wall interactions are negligible). One issue however, is that ECD is a measure of the entire event, including contributions from the attached cargo molecules in our case. Evidence of this was also observed in that earlier section (discussing the data of Figure 4) where the peak ECD value of Type 1 & 2 events in that dataset was noticeably larger than that of Type 0 events, preventing ECD from being distinctively a measure of the carrier type used there. One consequence of this is that a spread in the magnitude of current blockages of the cargo contributes to a spread in the ECD of the entire event, reducing our ability to resolve two or more populations centred on different ECD values. Figure 6c shows an example of this effect with a 2D histogram (max deviation vs. log[ECD]) of events from a mixture of 2-kbp and 7-kbp carriers with attached 12-arm star cargos (~11-nm pore, 3.6 M LiCl, 200 mV). Two populations are visible at smaller (2 kbp) and larger (7 kbp) ranges of ECD values but are not particularly well separated despite the ~3.5× difference in carrier length of the molecules that produced them. Part of the problem is the correlation between max deviation and ECD for these events. When plotted on this logarithmic scale, the effect is more drastic for the shorter molecules, since the ECD contribution from the cargo represents a larger fraction of the total ECD in that case and so results in a steeper slope of log[ECD] vs. max deviation. Clearly more carrier-focused metrics of the event signatures are needed to better classify events.
Classifying by Truncated ECD
To counteract the effect of the cargo on the total ECD, a parameter is defined that we call “truncated ECD” (TrECD). TrECD is defined such that only the sublevels that were associated with current blockages from the carrier during the clustering stage (clusters ‘1’ & ‘2’ in the example of Figure 3) enter the ECD calculation. Each such sublevel adds a contribution equal to its fitted blockage level multiplied by its duration (see Figure 6b). Using the fitted levels serves to further insulate this carrier-specific metric from any large but fast current deviations (e.g. from unfitted cargo blockages, extra noise, etc.) that could lead to less consistent calculated values among events from a single carrier-length. In this way, Figure 6d includes the same events as in Figure 6c but with its x-axis binned on TrECD instead. The two populations are now better separated in truncated ECD-space – especially the largest max deviation 2-kbp events from the smallest max deviation 7-kbp events – due to the greater consistency in TrECD value across the range of current blockages from the star junction. Two peaks exist in the TrECD distribution at ~134 fC and ~493 fC, separated by a factor of ~3.7×, in reasonable agreement with ratio of carrier lengths. (See Supporting Information Section S7 for the corresponding 1D histograms on ECD and TrECD).
Figure 6e shows another 2D histogram of max deviation vs. log[ECD], this time for a sample with multiple star cargo types (12-arm and 6-arm) attached to the same carrier length of 2 kbp (~8-nm pore, 3.6 M LiCl, 200 mV), and serves to illustrate the shortcomings of ECD as a carrier-focused metric under these circumstances. At least three populations are visible in the plot, from lowest to highest max deviation: i) a compact cluster at the bottom corresponding to folded carriers without an attached DNA star (unfolded Type 0 events have been removed prior to plotting via a threshold on max deviation to reduce clutter), ii) an intermediate population with a narrow spread in ECD but a large spread in max deviation corresponding to carriers attached to 6-arm stars, and iii) events with the largest blockages and a wide spread in ECD corresponding to carriers attached to 12-arm stars. (See Supporting Information Section S8 for a control of 6- and 12-arm star-attached carriers run separately on a single pore). The folded Type 0 events, lacking a star attachment to interfere with an ECD calculation, are presumably peaked around the “true” ECD value of the 2-kbp carrier in these conditions. Above that, the 6-arm star + carrier events start with their lowest max deviations at ECD values consistent with the peak of the Type 0 events, but then slant off toward larger ECDs as max deviation increases, just as observed for the samples in Figure 6c. Finally, the 12-arm star cargo + carrier events have both the largest ECD values and a very broad spread in these values.
The reason for this disparity between molecules with different cargos in the widths of their ECD distributions is that with the smaller nanopore diameter used here (~8 nm), the larger 12-arm stars interact strongly with the pore as they translocate, slowing them down considerably and imparting a significant variance to their dwell times, while the smaller 6-arm stars pass relatively freely (see Supporting Information Section S9 for sample current traces of both cases). Applying truncated ECD to this distribution again (Figure 6f), we now see both populations aligned above the peak value of the Type 0 events. Despite the chaotic cargo subevents that appear in the current signals of the translocating 2-kbp + 12-arm star hybrids, TrECD removes their influence on each event by looking only at the well-behaved, easily-fitted sublevels of the carrier (while relying on the presence/magnitude of the star subevent to further classify them). Comparing Figures 6e and 6f, in the latter case there now exists clear room to the “right” of these 2-kbp populations in TrECD-space where events from longer carrier lengths could reside and be well separated.
Extracting Events with Relaxed Selection Filters
Another step of the analysis process with room for improvement is in isolating the subsets of events that are the most useful in identifying the carrier lengths/cargo types of a sample. As examined in earlier sections, an approach to this that is based on recognizing an exact sequence of expected sublevels is optimized only for a narrow range of experimental conditions such as pore size: large pores (relative to the cargo dimensions) lead to short, poorly-fitted cargo subevents, while small pores can produce long, complex subevents that generate a variable number of fitted sublevels. Even with a molecular sample featuring a single cargo type that is well matched in size to the pore diameter, there can exist enough variability in how the events are fitted that a substantial portion of them is rejected as false negatives under this scheme, as evidenced by the statistics of Table 1 and rejection modes of Figure 5, above. While Nanolyzer settings can be tuned to optimize fitting, a certain amount of variability will always exist and will usually be correlated with particular subpopulations of events (e.g. those with subevents that are short compared to the rise time of the system are routinely underfitted).
As discussed, much of this variability in event substructure can be linked to the fitting of subevents from the cargo signals. A more versatile approach then would involve “relaxed” event filters based only on the (well-fitted) sublevels of the carriers as well as individual points of “large” current deviation that could be associated with the presence of the bulky attachments, regardless of how they were fitted.
Selection Filters Based on Local Max Deviation
An example of a simple set of two such selection filters that follow these principles is presented in Table 2, where each selection filter is designed to capture events in a separate regime of cargo subevent fitting. The first (“Filter A”) looks for Type 2 events (“free end first”, see Figure 2) where the cargo subevent has been missed by sublevel fitting, and so the last fitted sublevel of the event is instead that of an unfolded carrier (see Figures 5a and 7a for representative examples). This condition alone (#1 in Table 2) could be satisfied by an event from any of the unfolded or partially folded event categories (Types U and P), regardless of which end (if any) the cargo was attached to, and so it is supplemented with a second condition (#2) that checks if the event is also characterized by a large terminal spike in current, as would only be the case with Type 2 events. The test for this spike is if the maximum deviation within only the final baseline sublevel (“local max deviation”) exceeds a set threshold value (see Fig. 7a). Some strategies for choosing this threshold value are presented in the discussion surrounding the specific dataset of Figure 7, below.
Table 2:
Example set of “relaxed” selection filters that select for Type 2 events using only the sequence of fitted sublevels from the carrier and the “local max deviation” of individual sublevels. All numbered conditions must be satisfied simultaneously for an event to be classified under a given filter.
| Event Filter | Conditions | Target |
|---|---|---|
| Filter A | 1. event ends in unfolded carrier sublevel 2. max deviation of last (baseline) sublevel > threshold value |
Type 2 events without fitted cargo sublevel(s) |
| Filter B | 1. event ends not in unfolded carrier sublevel 2. event contains unfolded carrier sublevel (elsewhere) 3. max deviation of last (baseline) sublevel > threshold value or max deviation of 2nd last sublevel > threshold value |
Type 2 events with fitted cargo sublevel(s) |
A second filter (“Filter B”) looks for the remainder of Type 2 events – those with at least one fitted cargo sublevel. It starts with the condition (#1) that the event ends on a sublevel other than that of an unfolded carrier. Note that this makes Filters A and B mutually exclusive from their respective first conditions alone – they will extract out different sets of events that can later be combined without any repeats. Filter B also contains a condition (#3) that checks for local max deviations above a threshold value (analogous to Condition #2 of Filter A), but this time the threshold may be crossed in one or both of the last two sublevels, to account for Filter B events having at least one extra sublevel before the baseline through Condition #1 (see Figure 7b). Finally, an extra condition (#2) is added to Filter B that its events contain at least one unfolded carrier level somewhere (though not at their very ends, as this is precluded by Condition #1). This is aimed at removing any fully folded (Type F) events as well as any true negatives that are not recognizable as simple carrier-cargo events, as in the example of Figure 5h.
Together then, Filters A and B are capable of extracting Type 2 (“free end first”) carrier-cargo events with relative indifference to how well their cargo subevent signals were fitted to sublevels. By focusing only on the end of these events, they also lead to fewer false negatives arising from extra/alternative sublevels being fitted to earlier carrier subevents, as in the examples of Figures 5d–f. Note that this example set of selection filters ignores Type 1 events, since these may be problematic to our application of sorting cargo-carrier types due to the deeper blockages of Type 1P events (see Figure 4d) and the potential for confusion between the blockages of small cargo attachments and those of folded free carriers (Type 1U vs. Type 0P events). However, very similar filters that select for Type 1 events could be constructed by focusing instead on the sublevels at the start of an event (fitted carrier sublevels and sublevel max deviations), should the need arise.
Figure 7 shows the result of applying the selection filters of Table 2 to some carrier-cargo translocation events. The dataset chosen is the same as in Figures 6e and 6f and was generated from a sample of 6- and 12-arm stars attached to 2-kbp carriers and passed through a relatively small, ~8-nm pore (3.6 M LiCl, 200 mV). As discussed above, this combination of pore size and attachment types resulted in distinct regimes of cargo subevent fitting being represented: fast, poorly-fitted subevents (from “small” 6-arm stars) and slow, multi-level subevents (from “large” 12-arm stars) – see also Supporting Information Section S10 for an application of these filters with a larger, ~12-nm pore where both attachments fall in the “fast” regime. For the local max deviation threshold required by the filters, a value was chosen (1500 pA) that was intermediate between the average values of the unfolded (1040 pA) and folded (2142 pA) carrier levels. This represents a reasonable balance between the need to clear noise above an unfolded level while still capturing attenuated spikes from fast, folded subevents; an alternative criterion might be to use a fixed number of standard deviations above the unfolded carrier level (~4σ above 1040 pA here with the use of a 1500 pA threshold and a 500 kHz low-pass Bessel filter).
The output of Filter A on this dataset is presented in Figure 7c as a 2D-histogram of max deviation vs. log[ECD]. Most events captured here are from molecules with the smaller 6-arm star attachments, specifically those without a fitted cargo sublevel (see Figure 7a for the current trace and fitted sublevels of a typical event). In contrast, applying Filter B to the same dataset mainly extracts events from molecules with 12-arm star attachments, which feature wide spreads in dwell time/ECD due to interactions between this larger structure and the pore walls (top right population in Figure 7d, sample event in Figure 7b). Also collected by Filter B is a smaller subset with shallower blockages and shorter dwell times (middle population in Figure 7d) that represents some of the deepest of the 6-arm star events where a cargo sublevel was successfully fitted.
Figure 7e shows the union of the outputs from Filters A and B plotted on max deviation vs. log[ECD] while Figure 7f plots the same data on max deviation vs. log[TrECD], as defined earlier. This last plot illustrates the success of these filters – they result in two distinct populations (corresponding to the two attachment types) that are separable in max deviation and characterized by a single peak in TrECD, reflecting their shared 2-kbp carrier lengths. Note that part of the reason for this degree of separability came from targeting only Type 2 events, which had the effect of distinguishing the shallowest of the 6-arm star events from those of folded free carriers (no Type 1U events to be confused with Type 0P) as well as distinguishing the shallowest of the 12-arm star events from the deepest of the 6-arm star events (Type 1P, removed) – compare Figure 7f with Figure 6f, above.
In total, 1976 Type 2 events were extracted from the data of Figure 7 using the selection filters of Table 2 – 795 events with Filter A and 1181 events with Filter B – and importantly, these events spanned a range of cargo subevent fitting behaviors (0, 1, or 2+ fitted sublevels) and cargo subevent blockage depths resulting from the two different-sized star cargos. In comparison, when the rigid sublevel sequence approach of earlier sections was applied to the same data (see Supporting Information Section S11) only 540 total events, including 217 Type 2 events (an 89% reduction), were recovered, due primarily to the requirement that exactly one cargo sublevel be present per event and that this cargo sublevel be substantially deeper than that of a folded carrier in order to be labelled as such. Moreover, when examining how these 540 events are distributed over max deviation, a strong bias selecting specifically for Type 1P events from molecules with 6-arm star attachments becomes apparent (Figure S8d). As was postulated in the discussion of Figure 5, this likely corresponds to the type of event that is most easily fitted with a single cargo sublevel under these experimental conditions (e.g. deeper than other 6-arm star events due to the extra carrier fragment but not as prone to being fitted to multiple sublevels as 12-arm star events).
In summary, an approach to extracting useful events that is based on minimum sets of shared characteristics (e.g. carrier sublevels, threshold crossings) of their current signals rather than exact sublevel sequences can provide a dramatic improvement to the data curation and event segmentation process by being applicable to molecules in different size regimes simultaneously and by introducing fewer selection biases on the output (arising from bandwidth limitations, event fitting issues, or otherwise).
CONCLUSION
In this work, we presented the analysis of nanopore data generated by a basic structured polymer: an extended length of linear, double-stranded DNA (carrier) attached at one end to a DNA nanostructure (cargo). First, we organized the translocation events of a dataset from a single carrier-cargo pair by the folding state of their parent molecules to assess the contribution of molecular conformation to the distribution of several key metadata such as event duration and maximum (current) deviation. In the process, it was also observed how inconsistency in the fitting of discrete levels to the current signals can lead to many events being rejected by a categorization scheme that relies on recognizing exact sequences of these fitted sublevels. With this knowledge in hand, refinements of existing analysis tools/metrics were developed (e.g. truncated ECD, local max deviation) which, when combined with basic filtering and applied to datasets from mixes of molecular species (carrier length and cargo type), led to improved separation of different populations, lower rates of false negatives, and selection filters suited for data from a wider range of experimental conditions.
Over the course of analyzing data from this model system, a general approach emerged that should be applicable to most nanopore experiments on structured polymers: focusing on the most reduced subset of features from the full event substructure that, in a given application, groups populations of like molecular signals together. This allows for more events to be retained that might otherwise be rejected (e.g. folded events) on the basis of variation that emerges from different molecular conformations or different levels of fitting success. In the future, this approach could be further improved by incorporating a peak-finding or threshold-crossing algorithm to locate the short current spikes in the signal and therefore completely decouple the information found in them from fitted levels. Each event type could then be identified by the ordered list of carrier sublevels and local maxima/threshold crossings, which should provide rich information on which to base more universal event filters.
Another avenue for future investigation is in characterizing the efficiency of particular selection filters by quantifying the numbers of false negatives removed and the numbers of false positives that remain in the subsets they create. This would require the development of further metrics to distinguish accurately- from inaccurately-labelled events in both cases, something that may be challenging depending on how complex these distinctions are in a particular application. This is an area where approaches that incorporate a machine learning model may prove useful by only having to manually identify a limited number of representative examples from each set (or in simulated data with known ground truth) as training data; from there, the remainder of the false positives/negatives could be identified by progressively more accurate automated classification. Regardless of the method used, reducing the numbers of mislabelled signals will remain an important consideration for minimizing statistical bias as nanopore sensing of complex analytes continues to be established as a powerful bioanalysis technique into the future.
Going forward, the somewhat laborious event characterization process presented here (of clustering sublevels and defining selection filters) would ideally not be something to be performed after each nanopore experiment. Automation of this process will be a key component of incorporating nanopore-based molecular assays into real-world diagnostic technologies. To that end, we hope that some of the tools and approaches presented here will also find utility in the development of more complex AI/ML event classification models, by providing unbiased training data curated through these techniques. This way, event classification will only need to be performed once when a new experimental system is explored, and subsequent analysis can happen quickly based on the resulting trained model.
METHODS
DNA Sample Preparation
DNA carriers used in this study were synthesized by one of two methods. In the first method, a polymerase-chain reaction (Q5 DNA Polymerase, New England Biolabs) was carried out on a M13mp18 template (NEB) to create a double-stranded fragment of variable length using a phosphorylated forward primer and a phosphorothioated reverse primer (Integrated DNA Technologies). The phosphorylated strand was then selectively digested with λ-exonuclease (NEB) to leave a single-stranded scaffold. Finally, short (~40 nt) complementary oligonucleotides (IDT) were hybridized to this scaffold, creating a dsDNA fragment with many regularly-spaced nicks on one strand47 (subsequently repaired with T4 DNA ligase, NEB). The terminal oligonucleotides were chosen to be longer than the ends of the scaffold by 12 bases, resulting in single-stranded overhangs through which cargo molecules could be attached via a hybridization reaction.
In the second method of carrier synthesis, a variable length PCR product was again created, this time using lambda DNA (NEB) as a template. With this method, the PCR product is then digested with a restriction enzyme (e.g. PciI, NEB) to create short (e.g. 4 nt) sticky ends which are used to circularize the fragment (T4 DNA ligase, NEB). Finally, a unique sequence (cos) of lambda DNA48 present in the circularized fragments is used as the target of a λ-terminase enzyme (Catalano Lab, University of Colorado) which acts to create staggered nicks in the cos sequence,48 opening the DNA circles into linear fragments with 12-nt single-stranded overhangs. Carrier molecules generated by each of the two methods (oligo assembly and λ-terminase digestion) were found to perform comparably and were used interchangeably throughout the study.
For cargo molecules, DNA star junctions were synthesized as previously described.14,27 Briefly, short (48-nt) DNA oligonucleotides, each spanning two of the star junction arms, were mixed in equimolar ratios, heated to 95°C and allowed to cool slowly back to room temperature. In this way, each strand is hybridized to its two neighbours within the completed structure to form a multi-arm junction (see Fig 1a). For this application, one of the oligonucleotides was extended by 12 bases to create a single-stranded overhang that is complementary to the analogous overhang on the carriers. To purify the 6- and 12-arm junctions used here, crude DNA star samples were imaged by polyacrylamide gel electrophoresis, the main gel bands were excised, and the contained nanostructures re-suspended by electroelution using dialysis tubes (D-Tube Dialyzer Maxi, EMD Milipore).
Nanopore Fabrication & Sensing
The nanopores used in this study were fabricated by controlled dielectric breakdown, as described previously.49,50 In short, thin silicon nitride membranes (40×40 μm2, 12 nm thick) were purchased from Norcada Inc. as TEM windows (#NBPX5004Z-HR) and mounted in custom 3D-printed flow cells, sandwiching the membrane between two fluidic reservoirs filled with buffered electrolyte solution (1 M KCl, 10 mM HEPES, pH 8). A high transmembrane potential (~10 V) is then applied via Ag/AgCl electrodes connected to a custom fabrication circuit that also monitors the current across the membrane. As soon as a large jump in current is detected, signalling the perforation of the membrane (breakdown) and the onset of ionic conduction, the voltage is quickly switched off, resulting in a single nanoscale hole across the membrane.49,50
The size of each nanopore is estimated by first taking a series of current measurements at lower applied transmembrane voltages (≤ 1 V) and fitting the Ohmic I-V response to a straight line in order extract a pore conductance (G). The pore conductance is then converted into an estimate of pore diameter (d) using the following conductance model3:
where σ is the measured conductivity of the electrolyte solution and L is the nanopore length, estimated in this work as the nominal, manufacturer-supplied value of the membrane thickness.
Prior to a sensing experiment, both flow cell reservoirs were flushed with fresh sensing electrolyte (3.6 M LiCl, 10 mM HEPES, pH 8). A negatively-charged DNA sample (~1 nM per target and equilibrated to the same ionic strength/species as the sensing electrolyte) was then added to one reservoir and electrophoretically driven to the nanopore by application of a small transmembrane potential (0 – 200 mV). The resulting current signals from translocating molecules are recorded by a high-bandwidth amplifier (VC100, Chimera Instruments) at 1 MHz instrument bandwidth (4.17 MHz sampling rate).
Data Analysis
As detailed in the main text, raw current data from an experiment is first passed to the Nanolyzer software package (Northern Nanopore Instruments) for event detection and sublevel fitting to the internal event structure. Supporting Information Section S1 contains an example of the Nanolyzer settings used to configure this analysis. Event sublevels are then grouped by the values of metadata generated in the prior step (“sublevel blockage” and “sublevel duration”, see Supporting Information Section S2) using Nanolyzer’s Clustering tool. This associates a new numerical value (“sublevel label”) to each sublevel in the dataset. Finally, event selection filters based on these sublevel labels (in addition to other metadata, see S2) are entered as SQL queries into Nanolyzer’s Data Manager tool and used to create separately addressable subsets of events. A full list of selection filters used is presented in Supporting Information Section S3. In addition to the metadata generated by Nanolyzer, two others (“truncated ECD” and “local max deviation”, see main text for definitions) were calculated with custom scripts written in Python using the ‘pandas’ library. These scripts were applied to .csv files exported from Nanolyzer containing lists of all events/sublevels in each subset as well as the values of their associated metadata.
Supplementary Material
ACKNOLWEDGEMENTS
The authors would like to acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), through funding from grant #CRDPJ 530554-18. The authors would like to thank Drs. Carlos Catalano and Qin Yang of the University of Colorado for generously providing the λ terminase enzyme used in this work.
Footnotes
Supporting Information
Configuration details of Nanolyzer analysis; list of metadata used in Nanolyzer Data Manager; selection filters used – list of SQL queries; removing Type 0 events before sublevel clustering; first ten events from each category of Table 1 / Figure 4; comparing ECD distributions of Type 0 vs. Type 1/2 events; 1D histograms of ECD/TrECD for events of Figure 6; control – two carrier + star pairs run separately and together on a single pore; sample events of 2 kbp + 6/12-arm stars through smaller pore; Filters A & B applied to 2 kbp + 6/12-arm stars through larger pore; rigid sorting approach on events from 2 kbp + 6/12-arm stars in Figure 7 (PDF)
CONFLICT OF INTEREST
K.B. and V.T.-C. are, respectively, the CEO and CSO of Northern Nanopore Instruments Inc., a for-profit company that provides solid-state nanopore tools and software, including the Nanolyzer software used here. Z.R. declares no conflicts of interest.
REFERENCES
Note that an earlier draft version of this article is available as a preprint51.
- (1).Xue L; Yamazaki H; Ren R; Wanunu M; Ivanov AP; Edel JB Solid-State Nanopore Sensors. Nat Rev Mater 2020, 5 (12), 931–951. 10.1038/s41578-020-0229-6. [DOI] [Google Scholar]
- (2).Tabard-Cossa V. Instrumentation for Low-Noise High-Bandwidth Nanopore Recording. In Engineered Nanopores for Bioanalytical Applications; Edel JB, Albrecht T, Eds.; Norwich, N.Y. : Andrew William ; Oxford : Elsevier Science distributor, 2013; pp 59–93. 10.1016/B978-1-4377-3473-7.00003-0. [DOI] [Google Scholar]
- (3).Kowalczyk SW; Grosberg AY; Rabin Y; Dekker C Modeling the Conductance and DNA Blockade of Solid-State Nanopores. Nanotechnology 2011, 22 (31), 315101. 10.1088/0957-4484/22/31/315101. [DOI] [PubMed] [Google Scholar]
- (4).Ying YL; Hu ZL; Zhang S; Qing Y; Fragasso A; Maglia G; Meller A; Bayley H; Dekker C; Long YT Nanopore-Based Technologies beyond DNA Sequencing. Nat Nanotechnol 2022, 17 (11), 1136–1146. 10.1038/s41565-022-01193-2. [DOI] [PubMed] [Google Scholar]
- (5).Forstater JH; Briggs K; Robertson JWF; Ettedgui J; Marie-Rose O; Vaz C; Kasianowicz JJ; Tabard-Cossa V; Balijepalli A MOSAIC: A Modular Single-Molecule Analysis Interface for Decoding Multistate Nanopore Data. Anal Chem 2016, 88 (23), 11900–11907. 10.1021/acs.analchem.6b03725. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (6).Raillon C; Granjon P; Graf M; Steinbock LJ; Radenovic A Fast and Automatic Processing of Multi-Level Events in Nanopore Translocation Experiments. Nanoscale 2012, 4 (16), 4916. 10.1039/c2nr30951c. [DOI] [PubMed] [Google Scholar]
- (7).Plesa C; Dekker C Data Analysis Methods for Solid-State Nanopores. Nanotechnology 2015, 26 (8), 084003. 10.1088/0957-4484/26/8/084003. [DOI] [PubMed] [Google Scholar]
- (8).Sun Z; Liu X; Liu W; Li J; Yang J; Qiao F; Ma J; Sha J; Li J; Xu L-Q AutoNanopore: An Automated Adaptive and Robust Method to Locate Translocation Events in Solid-State Nanopore Current Traces. ACS Omega 2022, 7 (42), 37103–37111. 10.1021/acsomega.2c02927. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (9).Carson S; Wilson J; Aksimentiev A; Wanunu M Smooth DNA Transport through a Narrowed Pore Geometry. Biophys J 2014, 107 (10), 2381–2393. 10.1016/j.bpj.2014.10.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (10).Briggs K; Madejski G; Magill M; Kastritis K; de Haan HW; McGrath JL; Tabard-Cossa V DNA Translocations through Nanopores under Nanoscale Preconfinement. Nano Lett 2018, 18 (2), 660–668. 10.1021/acs.nanolett.7b03987. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (11).Fologea D; Brandin E; Uplinger J; Branton D; Li J DNA Conformation and Base Number Simultaneously Determined in a Nanopore. Electrophoresis 2007, 28 (18), 3186–3192. 10.1002/elps.200700047. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (12).Storm AJ; Chen JH; Zandbergen HW; Dekker C Translocation of Double-Strand DNA through a Silicon Oxide Nanopore. Phys Rev E 2005, 71 (5), 051903. 10.1103/PhysRevE.71.051903. [DOI] [PubMed] [Google Scholar]
- (13).Bell NAW; Muthukumar M; Keyser UF Translocation Frequency of Double-Stranded DNA through a Solid-State Nanopore. Phys Rev E 2016, 93 (2), 022401. 10.1103/PhysRevE.93.022401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (14).He L; Karau P; Tabard-Cossa V Fast Capture and Multiplexed Detection of Short Multi-Arm DNA Stars in Solid-State Nanopores. Nanoscale 2019, 11 (35), 16342–16350. 10.1039/C9NR04566J. [DOI] [PubMed] [Google Scholar]
- (15).Smeets RMM; Keyser UF; Krapf D; Wu M-Y; Dekker NH; Dekker C Salt Dependence of Ion Transport and DNA Translocation through Solid-State Nanopores. Nano Lett 2006, 6 (1), 89–95. 10.1021/nl052107w. [DOI] [PubMed] [Google Scholar]
- (16).Verschueren D. v; Jonsson MP; Dekker C. Temperature Dependence of DNA Translocations through Solid-State Nanopores. Nanotechnology 2015, 26 (23), 234004. 10.1088/0957-4484/26/23/234004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (17).Cao C; Krapp LF; Al Ouahabi A; König NF; Cirauqui N; Radenovic A; Lutz J-F; Peraro MD Aerolysin Nanopores Decode Digital Information Stored in Tailored Macromolecular Analytes. Sci Adv 2020, 6 (50), 1–9. 10.1126/sciadv.abc2661. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (18).Boukhet M; König NF; Ouahabi A. Al; Baaken G; Lutz J-F; Behrends JC Translocation of Precision Polymers through Biological Nanopores. Macromol Rapid Commun 2017, 38 (24), 1700680. 10.1002/marc.201700680. [DOI] [PubMed] [Google Scholar]
- (19).Chen K; Zhu J; Bošković F; Keyser UF Nanopore-Based DNA Hard Drives for Rewritable and Secure Data Storage. Nano Lett 2020, 20 (5), 3754–3760. 10.1021/acs.nanolett.0c00755. [DOI] [PubMed] [Google Scholar]
- (20).Zhu J; Ermann N; Chen K; Keyser UF Image Encoding Using Multi-Level DNA Barcodes with Nanopore Readout. Small 2021, 17 (28), 2100711. 10.1002/smll.202100711. [DOI] [PubMed] [Google Scholar]
- (21).Chen K; Kong J; Zhu J; Ermann N; Predki P; Keyser UF Digital Data Storage Using DNA Nanostructures and Solid-State Nanopores. Nano Lett 2019, 19 (2), 1210–1215. 10.1021/acs.nanolett.8b04715. [DOI] [PubMed] [Google Scholar]
- (22).Kong J; Bell NAW; Keyser UF Quantifying Nanomolar Protein Concentrations Using Designed DNA Carriers and Solid-State Nanopores. Nano Lett 2016, 16 (6), 3557–3562. 10.1021/acs.nanolett.6b00627. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (23).Bell NAW; Keyser UF Digitally Encoded DNA Nanostructures for Multiplexed, Single-Molecule Protein Sensing with Nanopores. Nat Nanotechnol 2016, 11 (7), 645–651. 10.1038/nnano.2016.50. [DOI] [PubMed] [Google Scholar]
- (24).Kong J; Zhu J; Chen K; Keyser UF Specific Biosensing Using DNA Aptamers and Nanopores. Adv Funct Mater 2019, 29 (3), 1807555. 10.1002/adfm.201807555. [DOI] [Google Scholar]
- (25).Sze JYY; Ivanov AP; Cass AEG; Edel JB Single Molecule Multiplexed Nanopore Protein Screening in Human Serum Using Aptamer Modified DNA Carriers. Nat Commun 2017, 8 (1), 1552. 10.1038/s41467-017-01584-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (26).Cai S; Sze JYY; Ivanov AP; Edel JB Small Molecule Electro-Optical Binding Assay Using Nanopores. Nat Commun 2019, 10 (1), 1797. 10.1038/s41467-019-09476-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (27).He L; Tessier DR; Briggs K; Tsangaris M; Charron M; McConnell EM; Lomovtsev D; Tabard-Cossa V Digital Immunoassay for Biomarker Concentration Quantification Using Solid-State Nanopores. Nat Commun 2021, 12 (1), 5348. 10.1038/s41467-021-25566-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (28).Lam MH; Briggs K; Kastritis K; Magill M; Madejski GR; McGrath JL; de Haan HW; Tabard-Cossa V Entropic Trapping of DNA with a Nanofiltered Nanopore. ACS Appl Nano Mater 2019, 2 (8), 4773–4781. 10.1021/acsanm.9b00606. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (29).Briggs K. Solid-State Nanopores: Fabrication, Application, and Analysis. PhD Thesis, University of Ottawa, 2018. 10.20381/ruor-22794. [DOI] [Google Scholar]
- (30).Chien C-C; Shekar S; Niedzwiecki DJ; Shepard KL; Drndić M Single-Stranded DNA Translocation Recordings through Solid-State Nanopores on Glass Chips at 10 MHz Measurement Bandwidth. ACS Nano 2019, 13 (9), 10545–10554. 10.1021/acsnano.9b04626. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (31).Misiunas K; Ermann N; Keyser UF QuipuNet: Convolutional Neural Network for Single-Molecule Nanopore Sensing. Nano Lett 2018, 18 (6), 4040–4045. 10.1021/acs.nanolett.8b01709. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (32).Arima A; Tsutsui M; Washio T; Baba Y; Kawai T Solid-State Nanopore Platform Integrated with Machine Learning for Digital Diagnosis of Virus Infection. Anal Chem 2021, 93 (1), 215–227. 10.1021/acs.analchem.0c04353. [DOI] [PubMed] [Google Scholar]
- (33).Taniguchi M; Minami S; Ono C; Hamajima R; Morimura A; Hamaguchi S; Akeda Y; Kanai Y; Kobayashi T; Kamitani W; Terada Y; Suzuki K; Hatori N; Yamagishi Y; Washizu N; Takei H; Sakamoto O; Naono N; Tatematsu K; Washio T; Matsuura Y; Tomono K Combining Machine Learning and Nanopore Construction Creates an Artificial Intelligence Nanopore for Coronavirus Detection. Nat Commun 2021, 12 (1), 3726. 10.1038/s41467-021-24001-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (34).Xia K; Hagan JT; Fu L; Sheetz BS; Bhattacharya S; Zhang F; Dwyer JR; Linhardt RJ Synthetic Heparan Sulfate Standards and Machine Learning Facilitate the Development of Solid-State Nanopore Analysis. Proceedings of the National Academy of Sciences 2021, 118 (11), 1–7. 10.1073/pnas.2022806118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (35).Tsutsui M; Takaai T; Yokota K; Kawai T; Washio T Deep Learning-Enhanced Nanopore Sensing of Single-Nanoparticle Translocation Dynamics. Small Methods 2021, 5 (7), 2100191. 10.1002/smtd.202100191. [DOI] [PubMed] [Google Scholar]
- (36).Zhang DY; Winfree E Control of DNA Strand Displacement Kinetics Using Toehold Exchange. J Am Chem Soc 2009, 131 (47), 17303–17314. 10.1021/ja906987s. [DOI] [PubMed] [Google Scholar]
- (37).Zhu J; Zhang L; Zhou Z; Dong S; Wang E Molecular Aptamer Beacon Tuned DNA Strand Displacement to Transform Small Molecules into DNA Logic Outputs. Chemical Communications 2014, 50 (25), 3321. 10.1039/c3cc49833f. [DOI] [PubMed] [Google Scholar]
- (38).Kong J; Zhu J; Keyser UF Single Molecule Based SNP Detection Using Designed DNA Carriers and Solid-State Nanopores. Chemical Communications 2017, 53 (2), 436–439. 10.1039/C6CC08621G. [DOI] [PubMed] [Google Scholar]
- (39).Beamish E; Tabard-Cossa V; Godin M Identifying Structure in Short DNA Scaffolds Using Solid-State Nanopores. ACS Sens 2017, 2 (12), 1814–1820. 10.1021/acssensors.7b00628. [DOI] [PubMed] [Google Scholar]
- (40).King S; Briggs K; Slinger R; Tabard-Cossa V Screening for Group A Streptococcal Disease via Solid-State Nanopore Detection of PCR Amplicons. ACS Sens 2022, 7 (1), 207–214. 10.1021/acssensors.lc01972. [DOI] [PubMed] [Google Scholar]
- (41).Chen P; Gu J; Brandin E; Kim Y-R; Wang Q; Branton D Probing Single DNA Molecule Transport Using Fabricated Nanopores. Nano Lett 2004, 4 (11), 2293–2298. 10.1021/nl048654j. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (42).Mihovilovic M; Hagerty N; Stein D Statistics of DNA Capture by a Solid-State Nanopore. Phys Rev Lett 2013, 110 (2), 028102. 10.1103/PhysRevLett.110.028102. [DOI] [PubMed] [Google Scholar]
- (43).Kowalczyk SW; Wells DB; Aksimentiev A; Dekker C Slowing down DNA Translocation through a Nanopore in Lithium Chloride. Nano Lett 2012, 12 (2), 1038–1044. 10.1021/nl204273h. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (44).Fologea D; Gershow M; Ledden B; McNabb DS; Golovchenko JA; Li J Detecting Single Stranded DNA with a Solid State Nanopore. Nano Lett 2005, 5 (10), 1905–1909. 10.1021/nl051199m. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (45).Karau P; Tabard-Cossa V Capture and Translocation Characteristics of Short Branched DNA Labels in Solid-State Nanopores. ACS Sens 2018, 3 (7), 1308–1315. 10.1021/acssensors.8b00165. [DOI] [PubMed] [Google Scholar]
- (46).Carlsen A; Tabard-Cossa V Mapping Shifts in Nanopore Signal to Changes in Protein and Protein-DNA Conformation. Proteomics 2022, 22 (5–6), 2100068. 10.1002/pmic.202100068. [DOI] [PubMed] [Google Scholar]
- (47).Bell NAW; Keyser UF Specific Protein Detection Using Designed DNA Carriers and Nanopores. J Am Chem Soc 2015, 137 (5), 2035–2041. 10.1021/ja512521w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (48).Catalano CE The Terminase Enzyme from Bacteriophage Lambda: A DNA-Packaging Machine. Cell Mol Life Sci 2000, 57 (1), 128–148. 10.1007/s000180050503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (49).Kwok H; Briggs K; Tabard-Cossa V Nanopore Fabrication by Controlled Dielectric Breakdown. PLoS One 2014, 9 (3), e92880. 10.1371/journal.pone.0092880. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (50).Briggs K; Kwok H; Tabard-Cossa V Automated Fabrication of 2-Nm Solid-State Nanopores for Nucleic Acid Analysis. Small 2014, 10 (10), 2077–2086. 10.1002/smll.201303602. [DOI] [PubMed] [Google Scholar]
- (51).Roelen Z; Briggs K; Tabard-Cossa V Analysis of Nanopore Data: Classification Strategies for an Unbiased Curation of Single-Molecule Events from DNA Nanostructures. ChemRxiv 2023. 10.26434/chemrxiv-2023-09wgb. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
