Skip to main content
Nature Communications logoLink to Nature Communications
. 2025 Nov 17;16:10059. doi: 10.1038/s41467-025-65004-7

Approaching single-molecule assembly-free readout from medium-length encoded DNA

Weigang Chen 1,2,3,✉,#, Rui Qin 1,#, Quan Guo 1,#, Jian Guo 1,#, Qi Ge 1, Yingjin Yuan 2,3,
PMCID: PMC12623992  PMID: 41249129

Abstract

For DNA data storage, nanopore sequencing can facilitate rapid readout but suffers from severe insertion/deletion errors, which are quite computationally expensive to correct. Here, we propose a nearly single-molecule and assembly-free readout scheme for medium-length pseudo-noise piloting DNA fragments. Specifically, we devise medium-length DNA fragments using low-density parity-check codes companioned by pseudo-noise sequence (PNC-LDPC). A single cleavage on this encoded DNA by transposase generates DNA fragments of approximately full length. Using the readout-aware pseudo-noise sequences, noisy nanopore reads with arbitrary start points are directly located, and base insertions/deletions are corrected, enabling fast and reliable recovery even at very low coverages. Experimental results indicate that the data can be reliably recovered at a coverage of 1.24–3.15× with a typical nanopore sequencing error rate of 1.83%. This method enables error-free recovery in near single-molecule scenarios, highlighting the potential of PNC-LDPC encoded medium-length DNA for data storage applications.

Subject terms: DNA sequencing, DNA computing and cryptography, Information theory, Data processing


Nanopore sequencing offers rapid DNA readout but suffers from severe insertion/deletion errors. Here, authors devise medium-length DNA fragments using a PNC-LDPC coding scheme, with an efficient cleavage library preparation to quickly recover original data at very low coverages without assembly.

Introduction

DNA data storage using synthetic DNA is expected to become one of the primary options for mass cold data storage due to its high storage density, long-term stability, and low energy consumption14. With the rapid development of DNA synthesis5 and assembly6,7, synthetic DNA in various sizes has been constructed for promising applications in biology and materials. Various forms of synthetic DNA have recently been evaluated to identify the most suitable DNA data storage media for modern storage systems1,2,811. They have become a very promising media for data backup due to their long-term stability1114. To access data from DNA, the readout latency and data reliability are the major obstacles1518. The traditional readout of DNA storage media is mainly based on biochemical reactions with higher readout latency19,20. In contrast, nanopore sequencing can achieve rapid readout due to fast electrical signal acquisition when DNA molecules pass through the nanopore channels21. However, the error rate of nanopore readout is relatively high, especially for short DNA fragments22,23. The readout errors include insertions/deletions (indels) that are difficult to deal with24,25, leading to high coverages and processing complexity requirements for error-free recovery8,9,2630.

Therefore, constructing large DNA fragments for readout was suggested. On the one hand, Organick et al. assembled multiple short oligonucleotides into large DNA fragments using the polymerase chain reaction (PCR) and Gibson assembly, and successfully recovered a 32 KB file at a coverage of 36× with nanopore sequencing31. Lopez et al. assembled large DNA fragments of approximately 5 kb from an oligo pool and successfully decoded 1.67 MB data with nanopore sequencing at a coverage of 22×32. Though they are much easier than genome assembly for biology6,7, these complex DNA manipulations at the readout end are still required. On the other hand, large DNA fragments can also be directly designed and synthesized for nanopore readout, particularly when a large number of subscribers share the writing cost. Yazdi et al. utilized constrained codes and homopolymer check codes to construct DNA fragments ( ~ 1 kb each) for storing several pictures, achieving data recovery using nanopore sequencing with a coverage of 200×15. Chen et al. constructed a very large DNA fragment, called a yeast artificial chromosome, with a length of 254 kb to store two pictures and a short video clip (37.8 KB), using efficient low-density parity-check (LDPC) codes and superimposed watermarks. With the nanopore devices, fast and portable data readout was achieved at a minimum coverage of 16.8×8. Sun et al. constructed a DNA fragment of 51 kb to encode a text file (5.56 KB) using a encoding scheme called MEPCAL (Mixed Error Processing Coding for Arbitrary Length). It required a minimum sequencing coverage of 9.13× for recovery33. For the large DNA storage mode, assembling noisy reads into contigs in the readout pipeline is required3436. It is usually computationally expensive (Fig. 1a).

Fig. 1. DNA data storage using medium-length DNA.

Fig. 1

a Existing DNA storage methods typically require clustering or assembly, resulting in high computational complexity. b Workflow of DNA data storage mode with medium-length plasmids. c Recovery tests were conducted for pLP2 (~33 kb) using highly matched reads. We independently tested the recovery performance using one to six highly matched reads. d Comparison with other DNA data storage schemes. This work enables the minimum coverage required for data recovery. Our method was evaluated based on the average coverage metric. The required average coverage was 1.24× for pLP1 (314 trials), 1.51× for pLP2 (264 trials), 3.15× for pLP3 and pLP4 (248 trials), and 1.82× for pLP5 (206 trials).

To address these challenges, a DNA storage scheme using DNA fragments ranging from several to tens of kilobases was proposed. This scheme supports fast and reliable readout in scenarios where multiple subscribers share the writing cost and expect fast readout (Fig. 1b). The low-density parity-check codes companioned by pseudo-noise sequence (PNC-LDPC) were designed for fast error identification and correction. An efficient DNA library preparation method was adopted to generate DNA fragments approaching full length, enabling error-free recovery at low sequencing coverages. The medium-length DNA, encoding Chinese or English poems, was evaluated using 28 short circular plasmids (approximately 6–8 kb, noted as pSP1–pSP28) and 6 long circular plasmids (approximately 33–43 kb, noted as pLP1–pLP5 and pLP2-e). The experimental results verified that the PNC-LDPC coding scheme required lower sequencing coverages than other schemes. The data can be recovered at a coverage of only 3× using the real nanopore sequencing reads, effectively reducing the sequencing cost and recovery time. When the DNA fragments highly match the codewords, only coverage of 1× is required to recover the data without error. In this case, the data can be retrieved on approximately a single molecule quickly and reliably. Overall, an efficient DNA storage scheme with high reliability and fast readout was provided using medium-length DNA fragments.

Results

Medium-length DNA fragments to balance reliable and fast readout

DNA data storage with large pools of short oligonucleotides cannot achieve fast readout using sequencing-by-synthesis10, while DNA data storage with large fragments is limited by complex DNA assembly manipulation8,33. Thus, the length of synthetic DNA fragments is critical for DNA data storage. We reported the data storage using medium-length DNA fragments, ranging from a few to tens of kilobases, assembled in plasmids. For a low code rate, four long DNA sequences, each approximately 33–43 kb in length, were encoded and constructed to store 5922 bytes of Chinese poems. These DNA sequences were also segmented into many short DNA fragments (indeed 28 pieces) of 6–8 kb, without any additional index information. For high code rate, a long DNA sequence approximately 43 kb in length, was encoded and constructed to store Shakespeare’s sonnets (3716 bytes).

The workflow contains four steps: (1) encoding, (2) DNA synthesis and assembly, (3) library preparation and sequencing, and (4) decoding and recovery (Fig. 1b and Supplementary Video 1). For low code rate, first, the data were encoded using non-binary LDPC codes3739, companioned by pseudo-noise sequences bit by bit, and then converted into base sequences spanning tens of kilobases. The non-binary LDPC codeword with a length of 22,680 bits and a code rate of R = 1/3 was decomposed into a single DNA sequence with 22,680 bases40. The other high-reliability LDPC codeword, with a length of 64,512 bits and a code rate of R = 1/2, encoded the information and check sequences into separate DNA fragments41. Similarly, for a high code rate, a binary LDPC code with the code rate of R = 0.93 was used to encode a DNA sequence with 32,000 bases42. This example was used to illustrate the flexible code rates of the proposed PNC-LDPC scheme.

Large DNA fragments can be decomposed into a series of small DNA fragments on the kilobase scale. These DNA fragments were synthesized and recursively assembled into circular plasmids as long-term data storage media. During readout, the circular plasmid was linearized using an efficient library preparation method. After sequencing, the companioned pseudo-noise sequences were extracted to quickly align the noisy reads, identify the indels, and convert them into erasures, which were then corrected using non-binary LDPC codes with strong error-correction capabilities.

The recovery assays verified that the proposed method requires very low sequencing coverages (Fig. 1c, d), and Supplementary Table 1 summarizes results for PNC-LDPC and other representative DNA storage coding schemes. Specifically, we chose a 33-kb plasmid as a test benchmark. With highly matched reads (30–35 kb), error-free data recovery was achieved when more than two reads were used. When only a single highly matched read was used, the recovery ratio was 93.4% for all 1610 trials (Supplementary Fig. 1). Overall, an average coverage of 1.24× to 3.15× was achieved for error-free recovery from medium-length DNA with different code rates. The results indicated that the data could be fully recovered if the fragment length was close to the codeword length and the error rate was relatively low. If the fragments had a considerable length gap to the codeword length, the error rate increased and the successful recovery ratio decreased.

PNC-LDPC scheme to enable strong error resistance and fast read alignment

The PNC-LDPC coding was proposed for medium-length DNA fragments to combat multiple types of readout errors and support fast readout at low coverages (Fig. 2a). The specific encoding steps were as follows (Supplementary Note 1 and Supplementary Figs. 2 and 3). First, the digital files were encoded using non-binary LDPC codes with large girths and significant error correction capability40,41, which were constructed with the high-girth Hamiltonian graphs over Galois Field GF(26). The detailed construction of non-binary LDPC codes is provided (Supplementary Note 2 and Supplementary Figs. 4 and 5). Next, the LDPC codewords were companioned with pseudo-noise sequences and converted into bit pairs. To design DNA sequences for data storage, the companion mode of block error-correction codes and pseudo-noise sequences has been validated to resist indels and facilitate the retrieval of different files43,44. Subsequently, the bit pairs were converted into bases to generate the data DNA sequences. Following this, the synthesized DNA fragments were assembled with plasmid vectors. In the case of a 945-byte text file, a ~33 kb plasmid (pLP2) and six short plasmids (pSP7–pSP12), each ranging in size from 6 to 8 kb, were generated (Fig. 2b). Figure 2c shows the structure of pLP2, which encodes seven Chinese poems. These plasmids were subsequently replicated in E. coli and extracted for use as DNA storage media (Fig. 2d). In this manner, 5922 bytes of Chinese poems were constructed into four long plasmids (33–43 kb) and 28 short plasmids (6–8 kb), as shown in Supplementary Fig. 7 and Supplementary Tables 24. To demonstrate the universality of the PNC-LDPC scheme, we used a binary LDPC code with a code rate of 0.93, to construct another plasmid (pLP5, 43 kb) with high logical density42 (Supplementary Fig. 10).

Fig. 2. Encoded medium-length plasmids with PNC-LDPC coding.

Fig. 2

a PNC-LDPC coding scheme. First, a digital file (945 bytes) was encoded using non-binary LDPC codes (LDPC (22680,7560), code rate R = 7560/22680 = 1/3; non-binary LDPC (64512,32256) code rate R = 1/2 can be found in Supplementary Fig. 3. Then, LDPC codeword and PN sequence were directly combined into bit pairs and converted into bases. b The data DNA sequences were inserted into plasmids to form the DNA storage mode. c The ancient Chinese poems were stored in pLP2. The storage details for pLP3–pLP4 and pSP7–pSP12 are shown in Supplementary Figs. 8 and 9. d Replication and extraction of plasmids. Plasmids were replicated in E. coli and extracted for use as DNA storage media.

The PNC-LDPC coding scheme combines LDPC codewords and pseudo-noise sequences bit by bit into base sequences. Noisy reads can be straightforwardly aligned, and insertions/deletions within them can be precisely identified through the pseudo-noise sequences without the interference of superimposed data. This scheme can support the alignment of reads at arbitrary start points and arbitrary lengths along the large DNA. Moreover, it differs from the DNA storage mode using oligo pools, where each oligo has a fixed start point and can be easily indexed with address sequences. For the medium and long DNA, the breakpoints of large DNA fragments are unknown during manipulation, making it impossible to add indices for each segment. Our proposed scheme addresses this issue and is compatible with the medium and long DNA for data storage. The proposed coding method differs from the superposition scheme (Supplementary Fig. 3), in which LDPC codes were first sparsified, and the sparsified codewords were superimposed with pseudo-noise sequences8. The pseudo-noise sequences were blurred by superimposed sparsified codewords. Besides, it requires complex read assembly manipulation, thus increasing the complexity of the recovery pipeline.

A single cleavage of plasmid by transposase to match the codeword

The correct plasmids containing the encoded sequences were constructed, extracted, and used as a DNA storage medium, followed by readout using nanopore sequencing. Oxford Nanopore Technologies (ONT) offers a rapid library preparation scheme that involves the cleavage of DNA by transposase and the addition of barcodes. However, this method often leads to shorter read lengths due to excessive cleavage into smaller fragments. To address this, a refined protocol was employed by adjusting the reagent dosage and reaction time for efficient library preparation (Fig. 3a). The fundamental principles were to increase the dosage of full plasmids and decrease the opportunity of multiple cleavages on a single plasmid. A comparison of the different reagent dosages and reaction times was detailed (Supplementary Figs. 1113). This protocol enabled the rapid acquisition of DNA fragments approaching full length, because a large portion of the plasmids were cleaved only once. For illustration, pLP2 was subjected to library preparation and nanopore sequencing. As shown in Fig. 3b, highly matched reads (30–35 kb in length) contributed 59.01% of the total base count. The base proportions were determined by calculating the ratio of bases in specific read length categories (grouped in 2.5-kb intervals) to the total base count. Figure 3c demonstrates that the improved protocol resulted in a higher proportion of highly matched reads than the traditional method.

Fig. 3. Efficient library preparation to obtain long reads for low-coverage recovery.

Fig. 3

a Comparison of data recovery between the efficient and traditional methods. b Base proportions across different read lengths of pLP2. c Length distribution of sequencing reads, with 94,507 reads for the efficient library and 18,483 reads for the traditional method. d Data recovery using reads from different library preparation methods. A total of 1610 reads from the efficient library and 928 reads from the traditional library were used for recovery tests. For each retrieval, 1–7 reads were randomly sampled without replacement. The exact number of trials is presented in Supplementary Data 2. e Read length distribution of four plasmids for n sequencing reads, with 20,962 reads for pLP1, 94,507 reads for pLP2, 21,783 reads for pLP3, and 33,963 reads for pLP4. The width of violin plots in c and e indicates the kernel density estimation. The thick vertical bar within each violin represents the interquartile range, with the limits corresponding to the 25th and 75th percentiles. The thin lines (whiskers) extend to the minimum and maximum values. Source data are provided as a Source Data file.

Moreover, the data recovery performance was compared for different library preparation methods. Using highly matched reads, the data can be error-free recovered with three reads or more (Fig. 3d). Actually, only an average coverage of 1.51× was required for pLP2 using an efficient library preparation method. The proposed library preparation improved the matching between DNA fragments and codewords, achieving efficient data recovery at low sequencing coverages. The raw reads were generated using the efficient library preparation method for four plasmids, each approximately 33–43 kb in length (Fig. 3e and Supplementary Fig. 14). For short plasmids (6–8 kb), most reads also lay in a restricted length range matching the plasmid size (Supplementary Figs. 1519).

PN sequence alignment and error correction to achieve assembly-free reliable readout

The proposed DNA library preparation method and nanopore sequencing generated effective long reads closely matching the codeword length. The known PN sequences were employed to align these reads and identify indels within them. The corrupted LDPC codewords were then corrected by removing insertions and converting deletions into erasures (Supplementary Fig. 20). Then, residual errors were corrected with iterative LDPC decoding, enabling reliable readout at low coverages (Supplementary Note 3). Without the limitation of sequencing coverage, very high error rates can be tolerated. The specific readout pipeline is illustrated as shown in Fig. 4a.

Fig. 4. Data recovery from long noisy reads.

Fig. 4

a The data recovery workflow consisted of four steps (Supplementary Figs. 21 and 22). First, noisy reads were demapped into corrupted LDPC codewords and PN sequences. Second, corrupted PN sequences were sliding-aligned with ideal PN sequences to locate reads and identify indels. Third, indels were corrected according to the identified indel positions. Finally, majority voting across multiple corrected codewords generated a consensus sequence for LDPC decoding. b Error characteristics of raw reads in our recovery tests (Supplementary Fig. 23). The libraries were sequenced on the MinION device with flow cell R10.4.1. c Data recovery tests were conducted with varying numbers of highly matched reads. Data points represent mean ± standard error from 5 independent experiments. In each experiment, 200 highly matched reads from each plasmid for recovery tests. Recovery with 1 read was performed in 200 trials, with 2 reads in 100 trials, with 3 reads in 66 trials, and with 4 reads in 50 trials. Error-free recovery was achieved for all plasmids using 3 highly matched reads, corresponding to the following average coverages: 2.97× for pSP1–pSP6, 2.96× for pLP1, 2.96× for pSP13–pSP28, and 2.95× for pLP3–pLP4. d The distribution of valid reads required for the error-free recovery of pLP2-e. Source data are provided as a Source Data file.

First, we validated the proposed data recovery procedure through practical ‘wet’ assays. A total of 32 plasmids—28 short circular plasmids and 4 long circular plasmids—were prepared. All plasmids were processed using the proposed efficient library preparation method, while the pLP1 and pLP2 were additionally processed using the traditional method for comparison. Multiple independent evaluations were performed using these real data to verify the reliability of data recovery (Supplementary Tables 5 and 6). The error characteristics of the proposed scheme were illustrated, with the initial error rate of the long sequencing reads being ~2% (Fig. 4b). After insertions/deletions modifications and consensus at a coverage of 3×, the erasure and error rates were within the error-correction capability of the non-binary LDPC codes (Fig. 4c). It was observed that this scheme can effectively correct insertions, deletions, substitutions, and erasures, achieving a very reliable recovery. To enhance the ability to handle long burst errors, we interleaved LDPC codewords in the PNC-LDPC scheme (Supplementary Fig. 6), thereby constructing an enhanced version of pLP2, referred to as pLP2-e. We conducted 662 independent experiments on pLP2-e (Fig. 4d). The experimental results indicate that recovery was achieved using fewer than 3 valid reads in 96.3% of the cases. Valid reads are defined as those in which the demapped corrupted PN sequence can be aligned with the ideal PN sequence. We further validated the high-rate performance of our coding scheme via in vivo experiments on pLP5. Under a raw error rate of 1%, error-free decoding was achieved with an average coverage of 1.82×, yielding a logical density of 0.93 bits/nt (Supplementary Fig. 24).

Then, we performed real-time readout verification using pLP1 to pLP4. Figure 5a shows the time required to decode three text files (a total of 5922 bytes) (Supplementary Note 4 and 5). After library preparation, long reads were generated by the MinION sequencer equipped with flow cell R10.4.1. Following indel correction, modified codeword fragments were merged to reconstruct the consensus sequence (Supplementary Fig. 25). The decoding process was completed in 364 s, achieving error-free data recovery (Fig. 5b and Supplementary Videos 24). In an additional real-time readout experiment, a total of 13,750 raw reads were generated, with an error rate of 6.7%. Recovery tests further demonstrated that using LDPC codes (22680,7560) for pLP1 and pLP2 required a lower average coverage during real-time data retrieval (Supplementary Fig. 27). Overall, it was illustrated that nanopore reading out medium-length DNA is more efficient than reading short oligos out45. It should be emphasized that this scheme can scale up to larger data volumes. But it requires higher computation complexity. Fortunately, we can accelerate this process using minimap2 with parallel threads46. We tested 37.8 KB of data—equivalent to 40× the information content of pLP1—using PNC-LDPC, and achieved error-free recovery from simulated reads with a 2.5% raw error rate (Supplementary Fig. 28). Reading out 40 plasmids totaling 37.8 KB of data with 40 threads, an average of 3.08 s were achieved (200 independent trials), compared to 1.01 s for pLP1 alone (945 bytes).

Fig. 5. Experimental validation of real-time readout using nanopore sequencing.

Fig. 5

a The time consumption of each step (Supplementary Note 4). All three files were recovered error-free within 364 s after sequencing started. Using locally stored sequencing reads for offline recovery tests, all three files achieved error-free recovery within 12 s. b The sequential arrival distribution of valid reads for error-free recovery from sequencing start. The x-axis represents the different plasmids, while the y-axis represents the number of valid reads for recovery. A total of 222 sequencing reads were processed, among which 12 valid reads longer than 2 kb were used for decoding the three files (Supplementary Fig. 26), accounting for 46.3% of the total base count. The 210 discarded reads, accounting for 53.7%, were either genome interference reads, those with very high error rates and short lengths, or reads generated after data recovery (shown in gray). Error-free recovery was achieved for all plasmids using these 12 valid reads, corresponding to the average coverages: 1.68× for pLP1, 1.35× for pLP2, 0.99× for pLP3, and 2.59× for pLP4 (Supplementary Tables 7 and 8). Source data are provided as a Source Data file.

Finally, we employed a simulation evaluation to predict the error correction upper bound of PNC-LDPC coding (Supplementary Table 11). The correction capability of the proposed PNC-LDPC scheme is determined by the consensus process and LDPC codes. In our design, once the code rate and the code structure of LDPC codes were chosen, the capability to correct substitutions and erasures of the LDPC code was fixed. The sequencing coverage determined the output error rate after the indel correction step. The larger the coverage was, the lower the output error rate to LDPC decoder was (Supplementary Figs. 2931). For example, error-free recovery was achieved at a raw error rate of up to 43% with an average coverage of 496× (R = 1/3), at a raw error rate of 39% with an average coverage of 730× (R = 1/2), and at a raw error rate of 28% with an average coverage of 242× (R = 0.93). It means that our PNC-LDPC scheme can effectively use the consensus to tolerate a very high error rate, though we have emphasized the low-coverage recovery using highly-matched long reads. The simulations demonstrated the scalability and robustness of the PNC-LDPC framework across different coding rates and sequencing conditions.

Discussion

We developed a fast, low-coverage, and reliable data storage scheme that utilizes medium-length DNA fragments as storage media, aiming to enhance existing DNA data storage paradigms. Our approach is characterized by the use of medium-length circular plasmids, ranging from a few kilobases to tens of kilobases, as the storage medium. To ensure high reliability, we introduced the PNC-LDPC coding scheme. As a constituent of the DNA fragments, the PN sequences can locate each noisy read and directly identify the indel positions, thus avoiding complex read assembly for large DNA fragments. We also proposed an efficient library preparation method that produces high-quality linearized fragments matching the codewords. This method enabled error-free recovery from nanopore reads, even at just 3× sequencing coverage.

The cost of writing data in medium-length DNA media should be considered for practical applications. The reduction in the synthesis cost of large DNA fragments would further enhance the value of this method9,46,47. Enzymatic DNA synthesis enables rapid and controlled single-base additions to a DNA molecule, allowing the direct synthesis of DNA molecules longer than 1000 bases without the need for assembly48. We expect that the storage cost of medium-length DNA will be significantly reduced shortly with the rapid development of DNA synthesis. Moreover, given the “write once, read many” (WORM) mode of DNA storage, the shared retrieval cost can be used as a key metric of DNA data storage (Supplementary Table 12). For numerous independent retrievals, the synthesis cost is shared, and the readout cost becomes dominant. For other molecular data storage, e.g., synthetic polymers, the cost of massive independent retrieval could not be significantly reduced, because the polymers could not be amplified (in contrast to PCR) and could not be read out using high-throughput methods (in contrast to NGS) conveniently. Therefore, in the WORM scenario, the low-coverage requirement of our scheme confers an additional cost advantage.

Methods

Workflow of DNA data storage using medium-length plasmids

The data storage workflow using medium-length circular plasmids was divided into four steps (Fig. 1b). First, the PNC-LDPC coding was employed to map digital files (Chinese poems) into data DNA sequences (Supplementary Data 1). Second, each data DNA sequence was decomposed and assembled to form 28 plasmids (6–8 kb) and four long plasmids (33–43 kb). Third, an efficient library preparation and nanopore sequencing were performed for the data-carrying plasmids. Finally, raw reads were fully utilized, with errors excluded by PN sequence alignment and the LDPC decoding algorithm, allowing rapid data recovery at low coverages.

PNC-LDPC coding scheme

A PNC-LDPC coding method was proposed to achieve fast, reliable recovery at low coverages (Supplementary Note 1). LDPC codes add redundancy to protect the information sequence against substitution or erasure errors. The PN sequences were employed to locate noisy reads and identify indels quickly. The PNC-LDPC coding method was used to encode 43 Chinese poems into DNA sequences (Supplementary Tables 9 and 10). Given the non-binary LDPC (22680,7560) code as an example, the PNC-LDPC encoding procedures are as follows.

  1. The information sequence of 7560 bits was encoded using non-binary LDPC codes to produce a full codeword of 22,680 bits.

  2. The codeword was combined with a pseudo-noise sequence of the same length (22,680 bits) bit by bit, forming the bit pairs.

  3. The mapping rule {(00) A, (01) T, (10) G, (11) C} was utilized to convert the bit pairs into bases, thus obtaining a data DNA sequence with a length of 22,680 bp (Fig. 2a).

  4. The data DNA sequences were synthesized and assembled to form medium-length plasmids in two modes: mode I, six short plasmids of 6–8 kb, and mode II, one long plasmid of approximately 33 kb (Fig. 2b).

Medium-length DNA construction and extraction

The encoded DNA sequences (referred to as DNA chunks) were divided into multiple 6–8 kb DNA mini-chunks, which were further subdivided into several ~1.5 kb building blocks using Spirillum 6.0. Polymerase cycling assembly and overlap extension PCR were used to synthesize each building block. Subsequently, the DNA mini-chunks were assembled onto a pUC57 vector through homologous recombination overnight using Trelief® SoSoo Cloning Kit Ver.2 (Tsingke TSV-S2). Each correct mini-chunk was then screened using colony PCR and Sanger sequencing.

Multiple mini-chunks and the pCC1413 vector were PCR amplified using I-5™ 2×High-Fidelity Master Mix (Tsingke TP001) to assemble the final full-length DNA chunk, and these fragments were obtained through gel extraction. Then, 100 fmol mini-chunks and 50 fmol linear vectors were assembled into a plasmid in one step via the yeast assembly. Colonies with correct plasmids were screened using colony PCR of each adjoining region. The plasmid was extracted using alkaline lysis and alcohol precipitation, then electrotransformed into E. coli EPI300. Finally, next-generation sequencing was employed to evaluate the correctness of the chunk sequence.

PureLink™ HiPure Plasmid Midiprep Kit (Thermo Fisher K210005) was adopted to extract the correct DNA plasmids. Notably, a 0.01% final concentration of L-arabinose was added during the culture of the four E. coli strains containing the pCC1413 vector to increase the copy number of plasmids (Fig. 2d).

Efficient library preparation and sequencing

The plasmids synthesized by Beijing Tsingke Biotechnology Co., Ltd. were carefully dissolved in double-distilled water to achieve a concentration of 1 μg/μl for each plasmid. Each plasmid DNA sample (1 µL) was quantified using Qubit dsDNA BR Assay Kit (Thermo Fisher Scientific, Cat# Q32850). Sequencing libraries were prepared using the Rapid Barcoding Kit 24 V14 (SQK-RBK114.24, Oxford Nanopore Technologies). Samples were diluted in sterilized ddH₂O to a working concentration of 330 ng/µL, and 9 µL of each diluted DNA was transferred to a 0.2 mL PCR tube for barcoding. For each sample, 9 µL of normalized DNA was mixed with 1 µL of the respective Rapid Barcode (RB01–24). The mixture was gently pipetted to ensure homogeneity and briefly centrifuged. The reaction was incubated in a water bath at 30 °C for 30 s, followed by 80 °C for 1 min, and then placed on ice for cooling. After centrifugation, all barcoded samples were pooled into a single 1.5 mL LoBind tube. The Rapid Adapter (RA) was prepared by combining 0.9 µL of RA with 2.1 µL of Adapter Buffer (ADB) and mixed thoroughly. A total of 1 µL of the diluted adapter was added to the pooled barcoded DNA, followed by gentle flicking and centrifugation. The final ligation reaction was incubated at 25 °C for 5 min before proceeding to sequencing on the R10.4.1 flow cell.

Fast data recovery at low coverages

A low-coverage reliable data recovery method was proposed, comprising PN sequence alignment and LDPC decoding (Fig. 4a and Supplementary Note 3).

  1. The mapping rule {A→ (00), T → (01), G→ (10), C→ (11)} was employed to demap the reads into the upper corrupted encoding LDPC codewords and the lower corrupted PN sequences.

  2. The corrupted PN sequences were aligned with ideal PN sequences using Minimap2, determining the positions of reads and identifying the specific indel positions within the noisy reads49.

  3. The corrupted LDPC codewords were corrected according to indel positions. Bits resulting from insertions were removed, and those corresponding to deletions were marked as erasures.

  4. A majority voting policy was applied to each position, and a consensus sequence was generated. It was marked as an erasure if the sequencing data were unavailable in certain locations or if the majority voting rule could not be satisfied.

  5. LDPC decoding was employed to correct the residual errors, and the original data were successfully recovered.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Supplementary information

41467_2025_65004_MOESM2_ESM.docx (14.4KB, docx)

Description of Additional Supplementary Files

Supplementary Data 1 (134.6KB, xlsx)
Supplementary Data 2 (16.8KB, xlsx)
Supplementary Video 1 (29MB, mp4)
Supplementary Video 2 (19.3MB, mp4)
Supplementary Video 3 (24.3MB, mp4)
Supplementary Video 4 (21MB, mp4)
Reporting Summary (84.4KB, pdf)

Source data

Source Data (7.2MB, xlsx)

Acknowledgements

This study was sponsored by grants from the National Key Research and Development Program of China (2023YFA0913800 and 2021YFF1200200 to W.C.; 2024YFF1500500 to Y.Y.). The authors thank Dr. Zaoxia Wang (Beijing Tsingke Biotechnology Co., Ltd) for her help in plasmid synthesis, and Yuxin Zhang for providing part of the LDPC coding program.

Author contributions

W.C. and Y.Y. proposed the study and checked the results. W.C. wrote the encoding program and part of the recovery program. R.Q. completed the library preparation and nanopore sequencing. J.G., Q.Guo, and Q.Ge wrote part of the recovery program. Q.Guo analyzed the sequencing data. J.G., Q.Guo, R.Q. and W.C. wrote the manuscript. W.C. and Y.Y. modified the manuscript. All authors reviewed and approved the manuscript.

Peer review

Peer review information

Nature Communications thanks Benjamin Cressiot and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Data availability

The original data files, encoded codewords, encoded sequences, plasmid sequences, and pseudo-noise sequences are available via Zenodo at 10.5281/zenodo.1688333250. The sequencing reads (FASTQ format) have been deposited in the Sequence Read Archive under accession number PRJNA1235219, and the raw electrical signal data (POD5 format) are available via Zenodo at 10.5281/zenodo.1688333250Source data are provided with this paper.

Code availability

The source code for rapid data readout in single-molecule-approaching scenarios is publicly available and has been deposited in GitHub at https://github.com/quanguo2088/Approaching-single-molecule-data-readout-for-DNA-Storage, under MIT license. The specific version of the code associated with this publication is archived in Zenodo and is accessible via 10.5281/zenodo.1688357351. This implementation makes use of several third-party software packages under their respective licenses, including LDPC codes by Radford M. Neal (https://github.com/radfordneal/LDPC-codes), LDPC codes by MacKay, D. J. C. (http://www.inference.org.uk/mackay/codes/data.html#l142), minimap2 by Li, H. (https://github.com/lh3/minimap2).

Competing interests

W.C. J.G., and Y.Y. have a Chinese patent relates to a DNA storage scheme based on pseudo-noise sequence companioned encoding (application number 2023106262346). The remaining authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Weigang Chen, Rui Qin, Quan Guo, Jian Guo.

Contributor Information

Weigang Chen, Email: chenwg@tju.edu.cn.

Yingjin Yuan, Email: yjyuan@tju.edu.cn.

Supplementary information

The online version contains supplementary material available at 10.1038/s41467-025-65004-7.

References

  • 1.Church, G. M., Gao, Y. & Kosuri, S. Next-generation digital information storage in DNA. Science337, 1628–1628 (2012). [DOI] [PubMed] [Google Scholar]
  • 2.Goldman, N. et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature494, 77–80 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Doricchi, A. et al. Emerging approaches to DNA data storage: challenges and prospects. ACS Nano16, 17552–17571 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Bar-Lev, D., Sabary, O. & Yaakobi, E. The zettabyte era is in our DNA. Nat. Comput. Sci.4, 813–817 (2024). [DOI] [PubMed] [Google Scholar]
  • 5.Kosuri, S. & Church, G. M. Large-scale de novo DNA synthesis: technologies and applications. Nat. Methods11, 499–507 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Wu, Y. et al. Bug mapping and fitness testing of chemically synthesized chromosome X. Science355, eaaf4706 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Xie, Z. et al. “Perfect” designer chromosome V and behavior of a ring derivative. Science355, eaaf4704 (2017). [DOI] [PubMed] [Google Scholar]
  • 8.Chen, W. et al. An artificial chromosome for data storage. Natl. Sci. Rev.8, nwab028 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Zhang, Q. et al. Catalytic DNA-assisted mass production of arbitrary single-stranded DNA. Angew. Chem. Int. Ed. Engl.135, e202212011 (2023). [DOI] [PubMed] [Google Scholar]
  • 10.Erlich, Y. & Zielinski, D. DNA Fountain enables a robust and efficient storage architecture. Science355, 950–954 (2017). [DOI] [PubMed] [Google Scholar]
  • 11.Liu, F., Li, J., Zhang, T., Chen, J. & Ho, C. L. Engineered spore-forming bacillus as a microbial vessel for long-term DNA data storage. ACS Synth. Biol.11, 3583–3591 (2022). [DOI] [PubMed] [Google Scholar]
  • 12.Grass, R. N., Heckel, R., Puddu, M., Paunescu, D. & Stark, W. J. Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angew. Chem. Int. Ed. Engl.54, 2552–2555 (2015). [DOI] [PubMed] [Google Scholar]
  • 13.Koch, J. et al. A DNA-of-things storage architecture to create materials with embedded memory. Nat. Biotechnol.38, 39–43 (2020). [DOI] [PubMed] [Google Scholar]
  • 14.Song, L. et al. Robust data storage in DNA by de Bruijn graph-based de novo strand ssembly. Nat. Commun.13, 5361 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Yazdi, S. M. H. T., Gabrys, R. & Milenkovic, O. Portable and error-free DNA-based data storage. Sci. Rep.7, 5011 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Hou, Z. et al. “Cell Disk” DNA storage system capable of random reading and rewriting. Adv. Sci.11, 2305921 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Xu, Y., Ding, L., Wu, S. & Ruan, J. Overcoming the high error rate of composite DNA letters-based digital storage through soft-decision decoding. Adv. Sci.11, 2402951 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Gopalan, P. S. et al. Trace reconstruction from noisy polynucleotide sequencer reads. US Patent 15/536,115 (2018).
  • 19.Fuller, C. W. et al. The challenges of sequencing by synthesis. Nat. Biotechnol.27, 1013–1023 (2009). [DOI] [PubMed] [Google Scholar]
  • 20.Shendure, J. et al. DNA sequencing at 40: past, present and future. Nature550, 345–353 (2017). [DOI] [PubMed] [Google Scholar]
  • 21.Deamer, D., Akeson, M. & Branton, D. Three decades of nanopore sequencing. Nat. Biotechnol.34, 518–524 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Wilson, B. D., Eisenstein, M. & Soh, H. T. High-fidelity nanopore sequencing of ultra-short DNA targets. Anal. Chem.91, 6783–6789 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Zee, A. et al. Sequencing Illumina libraries at high accuracy on the ONT MinION using R2C2. Genome Res.32, 2092–2106 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Banerjee, A., Yehezkeally, Y., Wachter-Zeh, A. & Yaakobi, E. Error-correcting codes for nanopore sequencing. IEEE Trans. Inf. Theory70, 4956–4967 (2024). [Google Scholar]
  • 25.Ceze, L., Nivala, J. & Strauss, K. Molecular digital data storage using DNA. Nat. Rev. Genet.20, 456–466 (2019). [DOI] [PubMed] [Google Scholar]
  • 26.Davey, M. C. & MacKay, D. J. C. Reliable communication over channels with insertions, deletions, and substitutions. IEEE Trans. Inf. Theory47, 687–698 (2001). [Google Scholar]
  • 27.Ping, Z. Towards practical and robust DNA-based data archiving using the Yin–Yang codec system. Nat. Comput. Sci.2, 11 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Yuan, L., Xie, Z., Wang, Y. & Wang, X. DeSP: a systematic DNA storage error simulation pipeline. BMC Bioinformatics23, 185 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Press, W. H., Hawkins, J. A., Jones, S. K., Schaub, J. M. & Finkelstein, I. J. HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints. Proc. Natl. Acad. Sci. USA117, 18489–18496 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Banerjee, A., Wachter-Zeh, A. & Yaakobi, E. Insertion and deletion correction in polymer-based data storage. IEEE Trans. Inf. Theory69, 4384–4406 (2023). [Google Scholar]
  • 31.Organick, L. et al. Random access in large-scale DNA data storage. Nat. Biotechnol.36, 242–248 (2018). [DOI] [PubMed] [Google Scholar]
  • 32.Lopez, R. et al. DNA assembly for nanopore data storage readout. Nat. Commun.10, 2933 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Sun, F. et al. Mobile and self-sustained data storage in an extremophile genomic DNA. Adv. Sci.10, e2206201 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Wee, Y. et al. The bioinformatics tools for the genome assembly and analysis based on third-generation sequencing. Brief. Funct. Genomics18, 1–12 (2019). [DOI] [PubMed] [Google Scholar]
  • 35.Senol Cali, D., Kim, J. S., Ghose, S., Alkan, C. & Mutlu, O. Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions. Brief. Bioinformatics20, 1542–1559 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Wang, Y., Zhao, Y., Bollas, A., Wang, Y. & Au, K. F. Nanopore sequencing technology, bioinformatics and applications. Nat. Biotechnol.39, 1348–1365 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Davey, M. C. & MacKay, D. J. C. Low-density parity check codes over GF(q). IEEE Commun. Lett.2, 165–167 (1998). [Google Scholar]
  • 38.Exoo, G. A trivalent graph of girth 17. Australas. J. Comb.24, 261–264 (2001). [Google Scholar]
  • 39.Poulliat, C., Fossorier, M. & Declercq, D. Design of regular (2, dc)-LDPC codes over GF(q) using their binary images. IEEE Trans. Commun.56, 1626–1635 (2008). [Google Scholar]
  • 40.Chen, W., Liang, C., Guo, T. & Ding, Y. Encoder implementation with FPGA for non-binary LDPC codes. In Proc. 2012 18th Asia-Pacific Conference on Communications (APCC) 980–984 (IEEE, 2012).
  • 41.Chen, W. et al. Non-binary LDPC codes defined over the general linear group: finite length design and practical implementation issues. In Proc. VTC Spring 2009 - IEEE 69th Vehicular Technology Conference (VTC) 1–5 (IEEE, 2009).
  • 42.MacKay, D. J. C. Encyclopedia of Sparse Graph Codes http://www.inference.org.uk/mackay/codes/data.html#l142 (2015).
  • 43.Liu, Y. & Chen, W. Hard-decision iterative decoder for the Davey–MacKay construction with symbol-level inner decoder. Electron. Lett.52, 1026–1028 (2016). [Google Scholar]
  • 44.Chen, W., Wang, L., Han, M., Han, C. & Li, B. Sequencing barcode construction and identification methods based on block error-correction codes. Sci. China Life Sci.63, 1580–1592 (2020). [DOI] [PubMed] [Google Scholar]
  • 45.Zhao, X. et al. Composite hedges nanopores codec system for rapid and portable DNA data readout with high INDEL-correction. Nat. Commun.15, 9395 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Eisenstein, M. Enzymatic DNA synthesis enters new phase. Nat. Biotechnol.38, 1113–1116 (2020). [DOI] [PubMed] [Google Scholar]
  • 47.Lee, H. H., Kalhor, R., Goela, N., Bolot, J. & Church, G. M. Terminator-free template-independent enzymatic DNA synthesis for digital information storage. Nat. Commun.10, 2383 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Palluk, S. et al. De novo DNA synthesis using polymerase-nucleotide conjugates. Nat. Biotechnol.36, 645–650 (2018). [DOI] [PubMed] [Google Scholar]
  • 49.Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics34, 3094–3100 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Chen, W. Datasets of real-time data readout for DNA storage. Zenodo 10.5281/zenodo.16883332 (2025).
  • 51.Chen, W. Software of real-time data readout for DNA storage. Zenodo 10.5281/zenodo.16883573 (2025).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

41467_2025_65004_MOESM2_ESM.docx (14.4KB, docx)

Description of Additional Supplementary Files

Supplementary Data 1 (134.6KB, xlsx)
Supplementary Data 2 (16.8KB, xlsx)
Supplementary Video 1 (29MB, mp4)
Supplementary Video 2 (19.3MB, mp4)
Supplementary Video 3 (24.3MB, mp4)
Supplementary Video 4 (21MB, mp4)
Reporting Summary (84.4KB, pdf)
Source Data (7.2MB, xlsx)

Data Availability Statement

The original data files, encoded codewords, encoded sequences, plasmid sequences, and pseudo-noise sequences are available via Zenodo at 10.5281/zenodo.1688333250. The sequencing reads (FASTQ format) have been deposited in the Sequence Read Archive under accession number PRJNA1235219, and the raw electrical signal data (POD5 format) are available via Zenodo at 10.5281/zenodo.1688333250Source data are provided with this paper.

The source code for rapid data readout in single-molecule-approaching scenarios is publicly available and has been deposited in GitHub at https://github.com/quanguo2088/Approaching-single-molecule-data-readout-for-DNA-Storage, under MIT license. The specific version of the code associated with this publication is archived in Zenodo and is accessible via 10.5281/zenodo.1688357351. This implementation makes use of several third-party software packages under their respective licenses, including LDPC codes by Radford M. Neal (https://github.com/radfordneal/LDPC-codes), LDPC codes by MacKay, D. J. C. (http://www.inference.org.uk/mackay/codes/data.html#l142), minimap2 by Li, H. (https://github.com/lh3/minimap2).


Articles from Nature Communications are provided here courtesy of Nature Publishing Group

RESOURCES