Skip to main content
Science Advances logoLink to Science Advances
. 2026 Apr 17;12(16):eaec1469. doi: 10.1126/sciadv.aec1469

From spacecraft ranging to massive DNA data storage: Composite ranging codes as indices and error correction references

Yuxin Zhang 1,, Rui Qin 1,, Qi Ge 1, Quan Guo 1, Weigang Chen 1,2,3,*
PMCID: PMC13089337  PMID: 41996509

Abstract

DNA data storage has emerged as a promising data archive medium, which distributes data across many unordered DNA strands. Thus, retrieving large-scale data from these strands is a critical challenge, for these strands require indices and are prone to errors. Here, we propose an accompanying indexing and progressive recovery framework with specialized long composite ranging codes (LCRCs) for massive DNA data storage. Specifically, short fractions of the LCRC serve as accompanying indices for megabyte to petabyte data. Correlation with the short component codes enables rapid data recovery, while alignment to the LCRC facilitates reliable recovery under severe insertions/deletions. Simulations reveal that this scheme can be extended to the petabyte scale. Progressive error correction enables low-coverage recovery. Real-time read-by-read decoding recovered 12.87-megabyte files in ~20 min with 3.66× coverage at an error rate of ~4.9% using a nanopore sequencer. This framework provides a universal and practical strategy for DNA data storage.


Composite ranging code is partitioned to index numerous DNA strands, enabling progressive recovery via correlation and alignment.

INTRODUCTION

Synthetic DNA is emerging as a promising medium for long-term and stable data archives due to its high density, durability, and low maintenance costs (13). This potential has been demonstrated via a series of proof-of-concept experiments (48). DNA data storage encodes and distributes information across numerous DNA strands (46). Massive synthetic DNA strands are pooled together in an unordered manner (5). For data storage with oligonucleotide (oligo) pools, digital files are stored in DNA pools comprising many short strands of 100 to 300 nucleotides (nt). For data storage with large DNA fragments, data are encoded into DNA fragments ranging from several to thousands of kilobases (kb) (9), which can then be read out via shotgun sequencing (10, 11). For these scenarios, a torn-paper model is formulated (12, 13). To facilitate data reconstruction, most of the prior studies primarily appended a short index to each unique strand, referred to as the index-payload (I + P) paradigm (Fig. 1A) (1, 2, 5). However, when it comes to massive data storage, constructing a large and reliable codebook including indices and payloads under this paradigm is quite difficult (14, 15).

Fig. 1. LCRC accompanies massive DNA strands and supports error correction.

Fig. 1.

(A) Traditional Index+Payload DNA data storage. (B) Proposed LCRC piloting massive DNA data storage. (C) Pros of the proposed scheme. (D) Data volume increases with the lengths of SCCs, where n is the SCC number. (E) Logical density (excluding primers) comparison with the state-of-the-art schemes. For in vitro experiments, HEDGES and DNAformer were verified at raw error rates of 3.59 and 4.59%, respectively; our scheme was stress-tested at 8.32 to 9.30%. For in silico simulations, Pi: Pd: Ps = 1:1:1.

On the one hand, indices are usually vulnerable to errors and DNA degradation. First, DNA strands are inherently prone to synthesis and sequencing errors, including insertions, deletions, and substitutions (1618). Short indices are difficult to protect (19). Second, DNA undergoes degradation (2022), e.g., hydrolysis (23), which can lead to strand breaks. This may cause the entire index to be missing. In addition, in data storage with large DNA, each shotgun sequencing fragment originates from a random location and varies in length (24). These short fragments lack indices. Therefore, many efforts have been dedicated to designing and protecting short indices. In DNA fountain codes, a short seed is encoded together with the data payload (5). HEDGES (hash encoded, decoded by greedy exhaustive search) applies an extra technique (called salt protection derived from cryptography) to the index (25). Other studies used error correction codes (ECCs) to protect the indices (26). If the index is missing, then a read is challenging to process. On the other hand, traditional payload reconstruction is computationally intensive due to insertions and deletions (indels). It generally relies on clustering, consensus, and error correction (6, 14, 2729). For instance, with popular heuristic algorithms, the clustering exhibits superlinear time complexity (30). There are also clustering-free schemes. Among them, HEDGES pioneered effective read-by-read correction of insertions, deletions, and substitutions (25). A de novo strand assembly algorithm that uses de Bruijn graphs enables robust data reconstruction from degraded DNA samples (31). For shotgun sequencing of large DNA, data retrieval commonly resorts to overlap-based de novo assembly (10, 11). These assembly-based methods require high sequencing coverage and computational effort.

Here, to address these challenges for massive DNA data storage, we propose an accompanying indexing and progressive recovery framework with specialized long composite ranging codes (LCRCs). This framework stems from spacecraft ranging with composite ranging codes (3237). Arbitrarily short fractions of the code can find their starting points via correlation with the short component codes (SCCs). Specifically, we redesign the LCRC with SCCs of similar lengths and use short fractions of the LCRC as indices to pilot massive DNA strands and facilitate error correction (Fig. 1, B and C). For readout, we establish a progressive error correction pipeline. Correlation with the SCCs is performed to filter low-error-rate reads. Notably, it requires only a relatively small number of correlation computations, enabling rapid recovery. An alignment-based method using the LCRC as a hidden reference is devised to identify and polish reads with severe indels. This framework adapts to variant DNA strand lengths, allowing it to be applied to both oligo pools and large DNA fragments. We performed gigabyte (GB)–scale data recovery simulations. These simulations indicated that our framework scales well to support data volumes of several to hundreds of petabytes (PB; 1015 bytes) (Fig. 1D). Simulations and experimental validations demonstrated that progressive error correction enables error-free recovery from low sequencing coverage across variant error conditions (tables S1 to S3). Even in the presence of DNA degradation, the stored data can still be fully restored. Compared with state-of-the-art schemes such as HEDGES and DNAformer (29), our framework tolerates higher indel rates while maintaining sequencing coverage and logical density comparable to HEDGES (Fig. 1E and tables S4 and S5). It stems from accurate index identification, read-by-read indel correction, and effective consensus. DNAformer, despite achieving a high logical density, relies on transformer-based networks for consensus and cannot work in a read-wise manner, requiring a high sequencing coverage of 16× (table S6). For detailed comparisons, see text S1. The proposed method also features low computational complexity and scales readily to large data volumes. In vitro experiments on the 12.87 megabyte (MB) of dataset verified the fast recovery capability. Moreover, this framework supports read-by-read decoding in real time, matching fast nanopore sequencing. In real-time nanopore sequencing tests, we rapidly retrieved image datasets of 12.87 MB, 4.72 MB, and 191.1 KB within ~20, ~6, and ~7 min, respectively, at low sequencing coverages of 3.66× to 5.69× with raw error rates of 4.9 to 8.6% (movies S1 to S8).

RESULTS

LCRCs indexing massive DNA storage

In DNA data storage, indices are quite crucial considering the distributed and unordered properties. However, under the traditional index-payload paradigm, indices face the challenge of complete absence or decoding failure due to high error rates, posing a substantial risk of data loss. We developed an accompanying indexing scheme using elaborately designed LCRCs for large-scale DNA storage (Fig. 1B and fig. S1). This scheme was inspired by spacecraft ranging with composite ranging codes, where short fractions of the code can be identified via correlation with SCCs, regardless of the starting point (fig. S2). The redesigned LCRC is made up of a limited number of SCCs, maintaining efficient identification of arbitrary starting sites and lengths. Figure 2 illustrates a toy example of constructing 11 pieces of 20-nt DNA oligos (details in table S7). User data with the length of 204 bits were encoded with a block ECC, adding 16-bit redundant bits. The LCRC was constructed with SCCs of similar lengths. In this example, an LCRC of 231 bits was constructed with three SCCs and partitioned into segments of 20 bits as indices. These short LCRC segments were combined with encoded bit vectors of the same length to form bit pairs. The bit pairs were subsequently converted into 20-nt DNA payload sequences by mapping {00, 01, 10, 11} to {A, T, G, C}. In practice, paired-end primers were appended to the DNA payload sequences to form full oligo sequences. This scheme is well-suited for both oligo pools and large DNA fragments due to its scalability in strand length. In the following, we performed a series of experiments to verify the advantages of this scheme (figs. S3 to S7). A series of encoding processes using LCRCs for these experiments is collected in figs. S8 to S14.

Fig. 2. A toy example of the accompanying indexing scheme with LCRCs.

Fig. 2.

(A) Three SCCs with lengths of 3, 7, and 11 are chosen to construct an LCRC with a full length of 3 × 7 × 11 = 231. The LCRC is partitioned into 11 nonoverlapping segments of 20 bits as accompanying indices. (B) User data of 204 bits are encoded into a 220-bit block codeword. The codeword is partitioned into 11 segments, each 20 bits in length. (C) Each codeword segment is combined with an LCRC segment to form bit pairs, and the pairs are then transcoded into a 20-nt DNA oligo sequence. In total, 11 pieces of oligos are generated. SCCs, short component codes; LCRC, long composite ranging code. Source data are provided in table S7.

First, the accompanying indexing scheme can be flexibly extended to large-scale data storage using LCRCs. It enables DNA pools to accommodate user data on a scale from MB to PB (Fig. 1D). The LCRC length expands sharply as the average SCC length increases, indexing up to 1015 short DNA strands (table S8). An LCRC is built from a logical combination of short-period pseudo-noise (PN) sequences (i.e., SCCs) (fig. S15). The lengths of the SCCs are pairwise relatively prime, so the full length of the LCRC is denoted as P=i=1npi, where n is the number of SCCs and pi denotes the length of the i-th SCC. The number of correlations required for an individual read identification is only i=1npi. Accordingly, the first design criterion is to use several SCCs of similar length to maintain a minimal correlation computational budget while maximizing the LCRC length. The combination logic of LCRC is redesigned using majority logic. The second criterion is to select a small odd number of SCCs (n = 3 or 5) to achieve a reasonable trade-off between indexing scale and correlation performance. We observed that using fewer SCCs or increasing the payload length leads to improved correlation performance (figs. S16 and S17). Moreover, SCCs with favorable balance properties are preferred (fig. S18). For details of the LCRC construction, see text S2.

Then, we stress-tested the error-tolerance upper bound of the correlation with SCCs (figs. S19 to S21). Simulations indicated that while the usage ratio (the percentage of reads identified via correlation) decreases as the total error rate rises, the correct identification ratio remains near 100%. To maintain a usage ratio of at least 50%, the error upper bound is about one to three errors within a 160-nt read (mainly insertions/deletions). It depends on the occurrence position of insertions/deletions. In contrast, for the substitution-only scenario, the upper bound extends to 23 substitutions within a 160-nt read.

Last, we verified that the LCRC-based accompanying indexing avoids long homopolymers and extreme guanine-cytosine (GC) content. It results from the randomization of the user data (6, 27, 38) and the inherent PN properties of LCRCs (text S3). For the designed sequence, more than 99% of homopolymer runs are shorter than 4 nt (figs. S22A and S23A). It is consistent with the theoretical prediction. The probability of long homopolymers decays exponentially with increasing run length (fig. S22B). Simulation and theoretical analysis consistently indicate that the GC content of the designed sequences is distributed between 40 and 60% with high probability (figs. S22, C to E, and S23B). Furthermore, the designed 200-nt single-stranded DNA structures exhibit high stability. Secondary structure analysis reveals that the sequences have minimum free energy values lower than −8 kcal/mol, indicating a stable structure (fig. S23C) (39). Therefore, we did not use additional constrained coding, thereby avoiding further loss in coding efficiency (38).

Progressive error correction with accompanying LCRCs

Leveraging the accompanying property, we previously used a single accompanying PN sequence to enhance the indel resistance of short DNA barcodes (40). It cannot meet the requirements of DNA data storage. Therefore, we borrowed the LCRCs for spacecraft ranging to facilitate read identification and error correction during readout. Figure 3A shows our progressive error correction pipeline that restores the original data with the concatenation of correlation, alignment, and consensus for ECCs. SCC correlation was applied to filter reads with a few errors, while LCRC alignment was adopted to correct severe indels (figs. S24 to S27). On this basis, we restored user data with consensus and ECCs. This progressive error correction pipeline can adapt to different error rates.

Fig. 3. Progressive data readout for LCRC-based DNA data storage.

Fig. 3.

(A) Overall data readout pipeline contains correlation, alignment, and consensus and error correction. (B) Illustration of correlation-based read identification. (C) Illustration of alignment-based read identification and read polishing.

First, we harnessed SCC correlation to enable rapid read identification (Fig. 3B). This approach allows the filtering of reads that are either without indels or with a few indels near the two ends (figs. S28 and S29). Following primer identification and demapping, the corrupted LCRC bit vector was correlated to the known SCCs. Only a small number of sliding correlations were required over the length of each SCC, rather than across the entire LCRC. The total number of correlation computations for a single read is i=1npi, substantially alleviating the computational burden (table S9). Next, the Chinese remainder theorem (CRT) was applied to solve the starting site (i.e., index) of the read. To mitigate misidentification, we incorporated an index double-check mechanism using a threshold check of the correlation with the associated segment of the LCRC. A correlation threshold (e.g., Thr = 0.8) was set to confirm the correct positioning of the read.

Then, to cope with indels efficiently, we used an alignment-based method using the LCRC (Fig. 3C). Because the ideal LCRC is readout-aware, it can serve as a reference for the index alignment. On the basis of the alignment path, the read position according to the LCRC was determined, while indel detection and correction within this read were simultaneously performed. Owing to its relatively high computational cost (table S9), this approach was only applied to those reads that failed the correlation threshold check. Furthermore, as read alignment is a standard routine in bioinformatics, substantial speed improvements can be achieved using existing fast read alignment algorithms, e.g., Burrows-Wheeler Alignment tool (BWA) (41).

Last, user data were fully recovered using consensus and ECCs. On the basis of the identified indices, the data sequence was assembled bit by bit via majority voting. Errors and erasures may persist in the consensus. Therefore, ECCs were used. A key consideration for ECCs is their error-erasure-correction capability. In our verification, efficient low-density parity-check (LDPC) codes and product codes (42) were used. In practice, ECCs with different code rates can be readily adopted to match error conditions and ensure reliable recovery (fig. S30).

Correlation-based readout enables rapid recovery from low-error-rate reads

In DNA data storage, traditional recovery pipelines start with clustering or assembling a large number of noisy reads, which imposes substantial computational challenges. Our scheme based on SCC correlation enables rapid identification of individual reads. An example of the correlation-based recovery pipeline is provided in fig. S26. This example uses a 20-nt read containing a substitution error. It illustrates the index identification via SCC correlation under low-error scenarios (details in table S10). Both in vitro experiments and large-scale simulations were performed to verify the low-coverage recovery under typical error profiles of next-generation sequencing (NGS) (table S11).

First, in vitro experiments verified rapid low-coverage recovery from DNA pools using SCC correlation. Three image datasets (191.1 KB, 4.72 MB, and 12.87 MB) were encoded into 11,745, 299,700, and 959,850 oligos of 200 nt, respectively. These oligos were synthesized into four DNA pools using a high-fidelity or a low-cost synthesis protocol: one small-scale high-fidelity DNA pool (HFS-Pool-11.7K) and three low-cost DNA pools (LCS-Pool-11.7K, LCS-Pool-300K, and LCS-Pool-960K). Using the correlation-based method, error-free data recovery was achieved at low coverage across all pools (Fig. 4A and table S12). Specifically, 4× and 6× coverages were required for HFS-Pool-11.7K and LCS-Pool-11.7K, respectively, 6.6× coverage for LCS-Pool-300K and 4.7× coverage for LCS-Pool-960 K (Fig. 4A and figs. S31 to S35). The raw NGS reads exhibited low error rates of ~0.18 to 1.34%. In this low-indel regime, most reads contained either no errors or only a few errors (figs. S36 to S39); error occurrence positions within a read showed relatively uniform frequency with low probability (figs. S40 to S43). Therefore, a high proportion of reads were identified via correlation (fig. S44 and table S13). For example, up to 96% of reads were identified in NGS sequencing of HFS-Pool-11.7K. The correlation-based method maintained accurate index identification with a correlation threshold of 0.8 (figs. S45 to S49). In addition, with flexible ECC configurations at different code rates, even when the equivalent bit error rate after consensus reached ~6%, it remained within the error-resilience capability of the ECCs (figs. S50 to S53). The equivalent bit error number was computed as (#error + #erasure/2). Moreover, experimental results confirmed that Thr = 0.8 provided optimal recovery performance (fig. S54).

Fig. 4. Experimental and simulation verification at NGS error rates.

Fig. 4.

(A) Error-free codeword recovery ratio of oligo pools using different indexing schemes. For HFS-Pool-11.7K, R = 0.42. For LCS-Pool-300K, R = 0.41. For LCS-Pool-960K, R = 0.35. For Sim-Pool1 to Sim-Pool5, R = 0.41. In silico simulations were performed using the realistic error model, with the sequencing error rate set to 0.005 (fig. S55). (B) Impact of different payload lengths on the index identification ratio, with an overall error rate Perr = 0.005 and Pi: Pd: Ps = 1:4:10. Each marker represents an LCRC construction with a specific average SCC length. (C) Index identification ratio of different samples, which is related to payload length and the error rate. HFS, high-fidelity synthesis; LCS, low-cost synthesis.

Then, we proved the scalability of correlation-based readout for PB-scale data storage using large-scale datasets. We adopted a realistic DNA storage channel model that incorporates both sequence bias and errors arising from synthesis, polymerase chain reaction (PCR), and sequencing (fig. S55 and text S4). We used the image dataset DOTA (43) and used product codes for data encoding with a logical density of 0.82 bits per nucleotide (bits/nt). Indexing schemes on a scale from GB to PB (Sim-Pool1 to Sim-Pool5) were tested. Error-free recovery of GB-scale data was attained at low coverages of 4.0× to 5.6× using five SCCs (n = 5) (Fig. 4A). The error and erasure rates decreased with increasing sequencing coverage (fig. S56, A and B). For large-scale data, increasing the SCC length may lead to a modest decline in identification performance (fig. S56C). It was attributed to the partial correlation property exploited via the proposed correlation scheme (44). This limitation can be mitigated by increasing the payload length (fig. S56D). Moreover, reducing the number of SCCs remains feasible for maintaining correlation-based identification accuracy when the SCCs are relatively long. For example, error-free recovery at 3.6× coverage was achieved for MB- to GB-scale data with n = 3 (fig. S57). Furthermore, the recovery performance across different pool sizes indicated that the coverage evolution exhibits an increasing trend with pool size (fig. S58).

Last, simulations and real experiments verified that the correlation-based method scales well across various payload lengths. Simulations showed that the index identification ratio improved with increasing payload length (Fig. 4, B and C). For data retrieval from NGS reads of HFS-Pool-11.7K, a higher index identification ratio was achieved at a payload length of L = 160 than at L = 130, resulting in reduced sequencing coverage requirements (fig. S59).

Alignment-based readout adapts to severe insertions/deletions

The correlation-based method enables rapid readout with NGS. However, it is not robust under severe insertion/deletion errors, leading to a low read usage ratio and high sequencing coverage. To address this limitation, we used an enhanced alignment-based approach that leverages the readout-aware LCRC. This method retrieves the accompanying index to the reference and enables the correction of indels. An example is provided to demonstrate the alignment-based recovery pipeline. A 20-nt read containing two indels is identified and polished via alignment to the LCRC (fig. S27). Specific data are detailed in table S10. To assess its performance under severe indels, the four synthesis pools were sequenced on a nanopore sequencer MinION or PromethION 2 Solo with flow cells R10.4.1. Three different base-calling models were used to generate datasets with different error profiles (table S11).

With the real sequencing data, our proposed readout strategy accommodates a wide range of error rates (Fig. 5A). With LCRC alignment, the read usage ratio increased significantly, approaching nearly 100%, even under a high error rate (payload region) of ~7.9% (fig. S44 and table S13). The correlation-based readout is well-suited for low-indel conditions (e.g., Illumina sequencing). The alignment-based readout is more cost-effective under severe indel conditions. Using this approach, error-free data recovery from all four pools was achieved at <7× coverage across raw error rates of 3.68 to 9.3% (Fig. 5B and figs. S31 to S33). For the large-scale pool LCS-Pool-960K, all six original images and one text were fully recovered at a coverage of 3.2× with a raw error rate of 4.59% (Fig. 5B). In contrast, the correlation-based readout required a coverage of 6.4×. Even at a high raw error rate of 9.3%, error-free recovery was achieved at 4.4× coverage using the alignment-based method. Extensive in silico simulations indicated that our scheme tolerates high indel rates of up to 13% at 5× coverage with an overall code rate of R = 0.33 (fig. S60). If the code rate was increased to 0.47, then it remained resilient to indel rates as high as 5%. Our method achieved enhanced error correction capability relative to HEDGES while maintaining a similar logical density (table S4).

Fig. 5. Experimental verification at high indel rates and real-time readout demonstration.

Fig. 5.

(A) Typical error rates for different real examples. SUP, super-accuracy base calling; HAC, high-accuracy base calling; FAST, fast base calling. (B) Comparison of error-free recovery ratio (independent trials using 70× sequencing data). (C) Comparison of data readout time with the state-of-the-art schemes (using 16 CPU threads). A uniform coverage of 5× was applied, with L = 260 and Perr = 0.005. SCCs with lengths of 67, 71, 79, 83, and 103 were selected. (D) Illustration of real-time readout of 12.87 MB of image files from a 960K-oligo pool using the ONT nanopore sequencer (PromethION 2 Solo) (movie S2). Each blue point in the scatter plot (available payloads) represents an available consensus sequence. Image credit: Unsplash, licensed under the Unsplash License (https://unsplash.com/license; free for use).

Complexity comparison with state-of-the-art schemes

The conventional recovery process relies on clustering, multiple sequence alignment (MSA), and consensus (6, 27). It is difficult to scale up to massive DNA data storage due to high computational costs. During these steps, reads are deeply coupled, and thus the computational complexity increases sharply. Our proposed scheme decouples the relations between different reads and identifies each read individually, offering an efficient solution to the computational challenges.

First, we compared the computation time in typical NGS scenarios, where the indel rate is relatively low. Our correlation-based scheme was compared with the state-of-the-art scheme based on clustering and MSA (text S5) (28, 45). The data reconstruction time was reduced by about two orders of magnitude (Fig. 5C). Large-scale data tests revealed that the run time for correlation-based data reconstruction increased nearly linearly with the increasing data volume (figs. S61A and S62). We also observed that the data reconstruction time increased slowly as the length of the LCRC increased (fig. S61B), underscoring the scalability of our solution for large-scale data storage.

Then, we evaluated the computation time of the alignment-based method. In scenarios with modest indel rates, HEDGES is an efficient scheme because it processes reads individually. Our alignment-based method attained comparable computational complexity (Fig. 5C). As the indel rate increased, the processing time rose accordingly (fig. S63 and tables S14 to S16). This increase resulted from a larger fraction of reads being triaged into the high-complexity alignment pipeline.

Next, we analyzed the ratio of sequencing reads triaged into the alignment pipeline (table S13). These reads failed the index double-check during correlation. The ratio was related to error rates and error profiles of the embedded LCRC within the reads (figs. S64 to S67 and text S5). If the error rate was high (e.g., nanopore sequencing with the fast base-calling model), then only very small fractions of reads (as low as ~7%) found the correct positions via correlation and passed the threshold check. In this scenario, readout with only alignment processes would spare the correlation operations and obtain better time efficiency at low coverage. The triage ratio was also related to the correlation threshold (fig. S64). With a higher threshold, more reads were triaged to the alignment pipeline, thereby increasing the overall computation workload.

Last, to reduce memory usage during alignment, only a small set of SCCs of the LCRC and their initial positions of each segment are stored. The LCRC can then be generated on the fly and loaded into memory segment by segment when large data volumes are stored using this scheme (fig. S68). This flexible implementation architecture reduces the required hardware resources, especially memory.

Simultaneous read-by-read decoding and nanopore sequencing

Most existing readout strategies require the completion of the entire sequencing run, leading to low readout rates and large latency. It requires several to tens of hours for this run. Motivated by real-time molecule-wise sequencing with nanopore platforms (4648), we highlighted the implementation of simultaneous decoding and sequencing with the proposed accompanying indexing and recovery framework. Once a sequencing read became available after the associated molecule passed through the nanopore and was base-called, our progressive readout pipeline could incorporate it. This read-by-read decoding strategy facilitates rapid data readout.

First, we performed real-time readout verification on three DNA pools (HFS-Pool-11.7K, LCS-Pool-300K, and LCS-Pool-960K) across KB to MB scales using nanopore sequencers (MinION or PromethION 2 Solo). We achieved full recovery in ~20 min for the 12.87 MB of data, ~6 min for the 4.72 MB of data, and ~7 min for the 191.1 KB of data (Fig. 5D, figs. S69 to S72, and movies S2 to S8). Our real-time decoding pipeline started with read-by-read identification (texts S6 to S8). Every time after collecting a certain number of reads, consensus calling on the identified indices was applied, followed by error correction. To avoid decoding failures due to high erasures, a base available ratio threshold (e.g., 90%) was set to trigger error correction attempts. As the sequencing reads streamed out continuously, the original images could be presented in a progressive way. For LCS-Pool-960K, after sequencing had run for about 10 min, ~79% of the coding bases had been accumulated (Fig. 5D). After another 10 min, the six images were read out and decoded without any error.

Then, we verified real-time readout under high-error scenarios using the fast base-calling model, which resulted in raw error rates of ~8.47 to 8.9%. For all three DNA pools, the minimum coverage required for successful data retrieval remained below 7× (figs. S71 to S74 and table S17). These results demonstrate that low coverage requirements and strong error-correction capability are maintained in real-time readout scenarios.

Last, readout experiments with different sequencers showed that the effective real-time readout rate is mainly limited by sequencing throughput. The readout rate increased approximately linearly with nanopore sequencing throughput (fig. S62, A and B, and tables S17 and S18). Considering that real-time sequencing was much slower than the recovery pipeline, we conjectured that the progressive recovery pipeline could sustain the readout rate as data volumes scaled to larger files with increased sequencing throughput. We further assessed the offline readout rate. For nanopore data collected with the super-accuracy and high-accuracy base-calling models, we used the correlation-based readout method. The readout rate was approximately independent of data scale across three different data volumes (fig. S62, C and D, and table S16). Notably, our scheme achieved an offline readout speed of ~0.84 MB/s compared with 0.18 MB/s for DNAformer (29).

Full-length indexing identifies reads with arbitrary starting sites and lengths

The proposed scheme supports efficient identification and correction of noisy sequencing reads with arbitrary starting sites. The results above have confirmed its feasibility for indexing oligos of uniform length. Here, we verified that the scheme also adapts well to shotgun-sequenced readout of large DNA fragments and to corrupted DNA pools due to DNA decay.

First, full-length indexing enabled assembly-free data recovery from large DNA fragments (Fig. 6A and fig. S75). The scheme was evaluated through in vivo and in silico experiments (table S19). For in vivo verification, a 22,680–base pair (bp) DNA fragment was constructed, accommodating a 945-byte text file encoded with LCRC. In this example, we applied a rate-1/3 nonbinary LDPC code (49), offering strong resistance to high error rates (figs. S4 and S12). The recovery scheme reconstructed the original sequence via a one-step consensus, avoiding the search for overlapping regions between reads. Thus, we achieved perfect data recovery from shotgun sequencing reads of the sample (HFS-L-DNA) with a coverage of 1.5× and a raw error rate of ~0.36% (Fig. 6B and fig. S76, A and B). Notably, this scheme substantially reduced the required sequencing coverage compared to the graph-based assembly approach, e.g., Velvet (text S9) (50). It should be emphasized that interference reads from the host genome and vector sequence were effectively excluded with the index double-check (figs. S76, C and D and S77). Furthermore, we simulated the NGS-based parallel readout of 97,000 pieces of 33-kb DNA fragments (Sim-L-DNA) using ART (51). It aimed to evaluate the large-scale data storage performance using ECCs with high code rates. Product codes with a code rate of 0.82 were used. Reliable data recovery via correlation was achieved at a coverage of 3.6× (fig. S78).

Fig. 6. Experimental verification on large DNA fragments and degraded oligo pools.

Fig. 6.

(A) Sparse distribution of valid reads for data recovery of HFS-L-DNA according to LCRC. (B) Comparison of error-free recovery ratio using shotgun sequencing (1000 independent trials, 150 nt, single-ended). (C) Illustration of accelerated aging experiments. The read usage ratio was compared to the traditional index-payload scheme (10-nt index next to the forward primer). (D) Length distribution of valid reads (bar) and the cumulative distribution (line). (E) Comparison of error-free recovery ratio (1000 independent trials) of LCS-Pool-11.7K and LCS-D-Pool-11.7K.

Then, full-length indexing effectively tolerated partial loss of one or both ends of the DNA strands. We degraded LCS-Pool-11.7K via incubation at 70°C, and the integrity of the DNA strands (200 nt) in the degraded pool (LCS-D-Pool-11.7K) was severely damaged (Fig. 6C). Nearly 87% of the reads featured strand breaks, reducing the average aligned reads length (106.5 nt in total length and 92.8 nt in payload length) to about half of the original strand length (fig. S79). For the index-payload scheme, if a consecutive 10-nt region adjacent to the forward primer was used as the index, only half of the reads retained the full index (Fig. 6C). In contrast, our scheme obtained 97.5% of the reads, among which ~40% were identified via correlation (Fig. 6D). As a result, we achieved error-free recovery from LCS-D-Pool-11.7K at the coverages of 4× and 8.2× for alignment-based and correlation-based readout methods, respectively (Fig. 6E and fig. S80). Compared to the LCS-Pool-11.7K (control group), the increase in sequencing coverage fell into the reasonable region.

Furthermore, accelerated aging tests on large DNA fragments demonstrated that the full-length accompanying indexing accommodates segments with arbitrary starting sites and shortened lengths. The ~33-kb DNA fragment (HFS-L-DNA) was incubated at 85°C for 1 to 5 hours. All degraded samples (HFS-D-L-DNA-1 h to HFS-D-L-DNA-5h) underwent fragmentation to different extents, yielding fragments with a broad distribution of lengths (figs. S81 and S82). Using the correlation-based method, error-free recovery was achieved at 2.5× coverage when the average fragment length decreased to 92.5 nt (HFS-D-L-DNA-3h). As the average fragment length further declined to 58.3 nt (HFS-D-L-DNA-5h), the required coverage increased to 5.6× due to the reduced read usage ratio (fig. S83). The alignment-based method maintained 92.17% valid reads and achieved error-free recovery at 2.5× coverage, showing no significant increase in sequencing coverage (table S20).

DISCUSSION

We have developed an accompanying indexing and progressive recovery framework for massive DNA data storage. It delivers user data using numerous short fractions of a specialized LCRC. To facilitate data retrieval across various error conditions, a progressive error correction strategy involving rapid correlation, robust alignment, and ECCs was used. Our accompanying indexing solution offers advantages in terms of data capacity, error tolerance, and computational efficiency.

First, our scheme enables scalable and efficient indexing of large-scale data, paving the way for a seamless extension to practical PB-scale DNA storage systems. From a computational perspective, existing DNA storage schemes may suffer from high complexity due to massive clustering or high redundancy due to index protection. They usually cannot adapt well to the massive amount of data. In contrast, our scheme has low complexity via the easy-to-use correlation, and the indices are flexible for different data scales.

Then, our scheme adapts well to harsh error conditions, especially severe indel errors and DNA decay. On the one hand, it suggests a promising opportunity to exploit cost-efficient synthesis and sequencing technologies for massive DNA data storage systems. In practical DNA storage applications, the high writing cost remains a critical challenge. Low-cost synthesis technologies, e.g., photolithographic or electrochemical synthesis (27), offer more affordable solutions for data writing despite high indel rates. In our verification, we achieved error-free recovery from low-cost synthesis DNA pools using both NGS and nanopore sequencing. It indicated the feasibility of our scheme for low-cost synthesis. On the other hand, by accompanying the index through all data bits, our scheme offers a decay-tolerant solution for long-term data storage. Although encapsulation of DNA can enhance DNA stability, DNA degradation is difficult to completely avoid during the long-term storage of data (4, 20). Using the full-length accompanying indexing, our scheme demonstrates strong resilience to DNA decay. It also provides a promising foundation for long-term DNA data storage.

Next, read-by-read decoding is anticipated to exploit the potential of real-time single-molecule sequencing for rapid readout. In our verification, we achieved a real-time readout of the stored data (12.87 MB) in ~20 min using a nanopore device. Compared to the previously reported rapid readout scheme using nanopore sequencing (46), our solution is superior in both readout time and data volume. However, protein nanopores currently cannot scale to extensive oligo pools. In the future, advancements in solid-state nanopore sensors hold promise for high-throughput sequencing. When combined with read-by-read decoding, this approach will improve the data readout rate.

The LCRC-based accompanying indexing scheme enables rapid and robust index identification. However, compared with the conventional “index + payload” structure, it loses the flexibility in the selective processing of some sequences. In index-payload designs, the real-time selective sequencing method using nanopores (e.g., ReadUntil) can identify the short index regions within reads and filter out nontarget reads at an early stage (52), thus fully exploiting the sequencing pores and reducing sequencing costs. The accompanying indexing could not support flexible selection of individual strands.

Last, on the basis of the accompanying property, the theoretical upper bound of the logical density that our scheme can obtain is log2K − 1, where K is the DNA alphabet size (for natural nucleotides, K = 4). Similar to HEDGES, the proposed scheme has a relatively low logical density to support read-by-read processing (table S5). Future efforts to use a larger molecular alphabet are expected to conquer this limitation. Several studies have reported the expanded alphabets comprising natural and non-natural nucleotides, e.g., chemically modified nucleotides (53). The proposed coding efficiency of our scheme improves with the increase of alphabet size. For instance, with an extended alphabet consisting of eight letters, the theoretical logical density can reach up to 2 bits/nt. These expanded alphabets can be detected with commercially available nanopore sequencers. However, high error rates of non-natural nucleotides remain a major challenge (53). Given that our solution is effective in handling complex errors, it holds the potential to overcome this limitation and advance the application of expanded alphabets in DNA data storage.

MATERIALS AND METHODS

Construction of the specialized LCRC

The specialized LCRCs for large-scale DNA data storage were constructed as follows (fig. S15 and text S2).

Step 1: Determine the SCC number n and the average length p¯, where n is an odd integer and n > 1. The LCRC length P was determined on the basis of the data volume V, such that V ≤ P. Given that P=i=1npip¯n, the SCC number n and the average length p¯ were accordingly determined.

Step 2: According to the expected average length p¯, select n periodic PN sequences with good balance from a predefined candidate set to serve as the SCCs. The balance property means that the number of 1’s and 0’s in the sequence is nearly equal.

Step 3: Periodically extend each SCC to obtain an extended sequence with a length equal to the product of the SCC periods.

Step 4: Use the strict majority combination logic to construct the LCRC. Let Cˆ1,Cˆ2,,Cˆn be the ±1 sequences of the corresponding SCCs C1,C2,,Cn, respectively. The LCRC represented by {−1, +1} was generated by the following combination logic

Cˆ=sign(Cˆ1+Cˆ2++Cˆn) (1)

The mapping between a binary sequence represented by {0, 1} and a ± 1 sequence represented by {−1, 1} was defined as

f:{0,1}{1,1} (2)

In verification experiments, five distinct PN sequences with periods of 31, 35, 43, 47, and 59 were chosen as the SCCs. The resulting LCRC had a full length of 129,374,315. Similarly, PN sequences with periods of 43, 47, 59, 63, and 67 were used as the SCCs to construct an LCRC with a full length of 503,307,819.

Criteria for SCC selection

In constructing the LCRC, the n PN SCCs were selected according to the following criteria (see text S2 for details).

1) The SCC lengths were chosen to be as close as possible, to balance the overall LCRC length against computational complexity.

2) The number of SCCs should be kept as small as possible to maintain favorable correlation-based identification performance. Moreover, to avoid ties in majority voting that arise with an even number of SCCs, this study adopted an odd number of SCCs (n = 3, 5).

3) The SCCs should be well balanced, with nearly equal numbers of 1’s and 0’s in each sequence, to ensure robust correlation performance.

Typical PN sequences suitable as SCCs include m sequences, Legendre sequences of length 4t − 1 (t is a positive integer), and two-prime sequences. Recommended SCCs for different scales are provided in fig. S15B.

DNA coding for data storage with small-scale oligo pools

The encoding process for data storage with oligo pools included three main steps (fig. S9).

Step 1: Data encoding using LDPC codes. The user data were first scrambled via modulo-two addition with a binary PN sequence. The resulting randomized binary data were then partitioned into uniform data blocks, each comprising 54,000 bits. Each block was encoded using a binary LDPC code to generate a 64,800-bit codeword, followed by a permutation process. The interleaved LDPC codewords were divided into data segments of equal length.

Step 2: Indexing and segmented mapping. The full length or a small fraction of the predefined LCRC was partitioned into fractions, each with a length equal to that of the data segments. After that, these fractions were combined with the data segments bit by bit to construct two-layer bit sequences (i.e., bit pairs) (40). In each sequence, the upper layer was the data segment D = (d1, d2, …, dL), and the lower layer was the LCRC segment (i.e., the index) C = (c1, c2, …, cL), where L denotes the payload length. The upper and lower bits at the same position formed a bit pair (dj, cj), 1 ≤ j ≤ L. The resulting two-layer bit sequences were transcoded into DNA payloads. Each bit pair (dj, cj) ∈ {00, 01, 10, 11} was mapped to a nucleotide in {A, T, G, C}.

Step 3: Appending primers. To facilitate PCR amplification, the constant paired primers (20 nt in length) were appended to both ends of the payloads. The full oligo sequences were ultimately obtained.

In the verification experiments, a small fraction (the length is 1,879,200) of the predefined LCRC was used. Two JPEG image files (97.8 and 93.3 KB) were encoded into 11,745 oligos, each 200 nt in length. Each DNA strand comprised paired-end primers of 20 nt each and a 160-nt payload.

DNA coding for data storage with medium- and large-scale oligo pools

The encoding process for data storage with medium- or large-scale oligo pools included two main steps (figs. S10 and S11).

Step 1: Data encoding using product codes. The user data were scrambled and organized into a matrix with Mr rows and 54,000 columns, where each row contained 54,000 bits. Then, a product code composed of LDPC and Reed-Solomon (RS) component codes was applied for column and row encoding. For column encoding (outer encoding), m consecutive bits from each row were extracted to form an RS symbol in the Galois field GF(2m), yielding Mr symbols. Subsequently, RS encoding of 54,000/m RS codewords was performed. For row encoding (inner encoding), each row of 54,000 bits was encoded using an LDPC code. Next, a bit-level diagonal permutation process was performed on the encoded data matrix.

Step 2: Indexing and segmented mapping. The data matrix was decomposed into segments along the row dimension and combined with the predefined LCRC fractions to construct payloads. Following the same procedure described above, the resulting bit pairs were converted into DNA sequences, and paired-end primers were appended.

For the medium-scale DNA pool containing ~300K oligos, three JPEG images, and one text file (totaling 4.72 MB) were encoded and distributed into 299,700 oligos, each 200 nt in length with a 160-nt payload. An LDPC-RS product code composed of 740 LDPC(64800, 54000) codes and 5400 shortened RS(740, 735) codes [shortened from RS(1023, 1018)] over GF(210) was used.

For the large-scale DNA pool containing ~960K oligos, six JPEG images and one text file (totaling 12.87 MB) were encoded and distributed into 959,850 oligos, each 200 nt in length with a 160-nt payload. An LDPC-RS product code was used, consisting of 2370 LDPC(64800, 54000) codes and 4500 shortened RS(2370, 2000) codes [shortened from RS(4095, 3725)] over GF(212).

DNA coding for data storage with large DNA fragments

The encoding process for data storage with large DNA fragments was simplified to two steps (fig. S12).

Step 1: Data encoding using nonbinary LDPC codes. A nonbinary LDPC code defined on GF(26) (49) was used for data encoding. A 945-byte text file was converted to binary form, followed by encoding and permutation, resulting in a codeword of 22,680 bits.

Step 2: Indexing and mapping. A small fraction (the length is 22,680) of the predefined LCRC was used. Because large DNA fragments were used as a whole for data storage, this LCRC segment was combined with the codeword bit by bit and then converted into a 22,680-bp DNA fragment.

Oligo pool synthesis

Medium- and large-scale oligo pools, LCS-Pool-300K and LCS-Pool-960K, were synthesized by Dynegene Technologies, containing 299,700 and 959,850 oligos of 200 nt, respectively (table S11). After elution, each pool yielded 3 μg of synthesized DNA.

In addition, two small-scale oligo pools, HFS-Pool-11.7K and LCS-Pool-11.7K, were synthesized by Twist Bioscience and Dynegene Technologies, respectively, each containing the same set of 11,745 oligos of 200 nt (table S11). After elution, the former yielded a total mass of 245 ng of synthesized DNA, while the latter yielded 4 μg.

Large DNA fragment construction and assembly

The 33-kb plasmid (HFS-L-DNA) was synthesized tby Beijing Tsingke Biotech Co. Ltd., wherein the vector (pCC1413) was assembled with the encoded DNA fragment (table S19). First, the 22-kb encoded DNA fragment was divided into ~1.5-kb building blocks using the dynamic programming algorithm Spirillum 6.0. The DNA blocks were synthesized using polymerase cycling assembly and overlapping extension PCR techniques. Subsequently, Trelief SoSoo Cloning Kit Ver.2 (Tsingke, catalog no. TSV-S2) was used for homologous recombination to assemble the DNA blocks into the pCC1413 vector sequence. Correct DNA blocks were screened using colony PCR and Sanger sequencing. Then, PCR amplification of multiple DNA blocks and the vector sequence was performed using I-5 2 × High-Fidelity Master Mix (Tsingke, catalog no. TP001), followed by assembly and gel purification to obtain the final full-length DNA. A total of 100 fmol of DNA block and 50 fmol of linearized vector were taken for single-step assembly into a plasmid using yeast in vivo assembly. Each junction region was verified via colony PCR to identify target plasmid colonies. Last, the plasmids were isolated and purified via alkaline lysis and ethanol precipitation and electroporated into Escherichia coli EPI300.

DNA library preparation

Four synthesized DNA libraries were diluted prior to PCR amplification, respectively. For the medium- and large-scale oligo pools (LCS-Pool-300K and LCS-Pool-960K), the dried DNA powders were dissolved in 40 μl of double-distilled water (ddH2O). A sixfold dilution was prepared by mixing 5 μl of the stock solution with 25 μl of ddH2O, yielding a final concentration of 12.5 ng/μl. HFS-Pool-11.7K was dissolved in ddH2O to a final concentration of 0.49 ng/μl, as quantified using the Qubit ssDNA Assay Kit (Thermo Fisher Scientific, catalog no. Q10212). For LCS-Pool-11.7K, a 20-fold dilution was prepared by mixing 5 μl of stock solution with 95 μl of ddH2O, yielding a final concentration of 1.4 ng/μl.

Then, samples from DNA libraries were independently PCR-amplified for 15 cycles using the KAPA HiFi HotStart PCR Kit (Roche, catalog no. KK2502) under identical conditions. The primers used for amplification were purchased from Beijing Tsingke Biotech Co. Ltd.

For the medium- and large-scale oligo pools (LCS-Pool-300K and LCS-Pool-960K), PCR was performed in 50-μl reactions containing 1.5 μl of KAPA dNTP Mix (10 mM each), 3 μl of 10 μM forward primer, 3 μl of 10 μM reverse primer, 10 μl of 5× KAPA HiFi Fidelity Buffer, 1.5 μl of KAPA HiFi HotStart DNA Polymerase (1 U/μl), and diluted DNA template (8 μl for LCS-Pool-960K and 2.6 μl for LCS-Pool-300K), with ddH2O added to 50 μl. Thermocycling conditions were 95°C for 2 min; 15 cycles of 98°C for 20 s, 65°C for 20 s, and 72°C for 30 s; and a final extension at 72°C for 3 min. PCR products were purified using AMPure XP Beads at a bead-to-DNA ratio of 1.8:1 and eluted in 40 μl of ddH2O, yielding 24 ng/μl (LCS-Pool-300K) and 25.2 ng/μl (LCS-Pool-960K) as measured by the Qubit 1× dsDNA HS Assay Kit (Thermo Fisher Scientific, catalog no. Q33231). The primer sequences were as follows: forward, 5′-AATCATGGCCTTCAAACCGT-3′; reverse, 5′-AACAAGACTTTCGGAGCGTT-3′.

For the small-scale DNA libraries (HFS-Pool-11.7K and LCS-Pool-11.7K), PCR was performed with the same kit and cycling protocol in 25-μl reactions containing 0.75 μl of KAPA dNTP Mix (10 mM each), 2.5 μl of 10 μM forward primer, 2.5 μl of 10 μM reverse primer, 5 μl of 5× KAPA HiFi Fidelity Buffer, and 0.75 μl of KAPA HiFi HotStart DNA Polymerase (1 U/μl). Each reaction used 3 μl of diluted HFS-Pool-11.7 K sample (0.49 ng/μl) or 10 μl of diluted LCS-Pool-11.7K sample (1.4 ng/μl) as template, with ddH2O added to 25 μl (10.5 μl for HFS-Pool-11.7K and 3.5 μl for LCS-Pool-11.7K). PCR products were purified with AMPure XP Beads (1.8:1 bead-to-DNA ratio) and eluted in 20 μl of ultrapure water, yielding 30.8 ng/μl (HFS-Pool-11.7K) and 41.4 ng/μl (LCS-Pool-11.7K) as determined by the same Qubit assay. The primer sequences were as follows: forward, 5′-ATAATTGGCTCCTGCTTGCA-3′; reverse, 5′-AATGTAGGCGGAAAGTGCAA-3′.

Illumina sequencing of oligo pools

After PCR amplification, the sequencing libraries of HFS-Pool-11.7K, LCS-Pool-11.7K, LCS-Pool-300K, and LCS-Pool-960K were constructed from the respective purified PCR products using the standard Illumina protocol. The quality of the final libraries was assessed on the Agilent 2100 system (Agilent, USA) and further quantified using quantitative PCR to a final concentration of 1.5 nM. Subsequently, the libraries were sequenced to produce 150-nt paired-end (PE150) raw reads.

Nanopore sequencing of oligo pools

Medium- and large-scale pools (LCS-Pool-300K and LCS-Pool-960K) were sequenced using high-throughput PromethION R10.4.1 flow cells (Oxford Nanopore Technologies). The small-scale pools (HFS-Pool-11.7K and LCS-Pool-11.7K) were sequenced using MinION R10.4.1 flow cells (Oxford Nanopore Technologies). First, the sequencing libraries were prepared using the Ligation Sequencing Kit V14 (SQK-LSK114) according to the manufacturer’s instructions. Then, each DNA library was loaded into a separate flow cell as soon as it was ready and sequenced. All runs were base-called using the base caller Dorado (v7.6.8) integrated within MinKNOW (v24.11.10) to generate FASTQ files. The Q-score filtering was set at 10 for super accuracy base calling (SUP), at 9 for high-accuracy base calling (HAC), and at 8 for fast base calling (FAST).

Shotgun sequencing of large DNA

The enriched library of HFS-L-DNA was dissolved in ddH2O to a final concentration of 100 ng/μl. The library was then randomly fragmented to the desired size range using a Covaris ultrasonicator. The fragmented DNA underwent end repair to generate blunt ends, followed by 3′ A-tailing to enable adapter ligation. Sequencing adapters compatible with the Illumina platform were then ligated. The ligated products were subsequently purified to remove unbound adapters and other contaminants. Fragment size selection was performed to enrich the desired insert size, followed by PCR amplification. The final sequencing library was quantified and quality-assessed. The 150-nt paired-end sequencing run was performed on the Illumina platform to collect reads.

Accelerated aging test of oligo pools

For the accelerated aging experiment, 300 ng of the purified LCS-Pool-11.7K PCR product was diluted with ddH2O to a final volume of ~75 μl. The diluted solution was aliquoted into three 200-μl PCR tubes, each containing 25 μl. The samples were then incubated at 70°C for 12 hours in the absence of light and stored at −20°C until sequencing. The DNA concentration before and after aging was measured using the Qubit 1X dsDNA Assay Kit (Thermo Fisher Scientific, catalog no. Q33231). The integrity of the sample was assessed using agarose gel electrophoresis. Quantification revealed an initial DNA mass of 100 ng, which decreased to 46.25 ng after accelerated aging. The final degraded DNA library (LCS-D-Pool-11.7K) was sequenced on the Illumina platform, yielding PE150 reads.

Accelerated aging experiments of large DNA

For the accelerated aging experiment of large DNA (HFS-L-DNA), the sample solution was diluted to a concentration of 100 ng/μl. A total volume of 250 μl was aliquoted into 10 200-μl PCR tubes (25 μl each). The samples were incubated at 85°C for 1, 2, 3, 4, and 5 hours, respectively, in the absence of light. Following heat exposure, samples were immediately cooled and stored at −20°C before downstream processing. DNA concentrations before and after aging were quantified using the Qubit 1X dsDNA Assay Kit (Thermo Fisher Scientific, catalog no. Q33231). All samples exhibited reduced DNA content after thermal treatment, decreasing from the initial 100 ng/μl to ~12.8 to 15.7 ng/μl, depending on the incubation duration. The integrity of thermally aged large DNA was examined using agarose gel electrophoresis to confirm degradation. The final degraded samples (HFS-D-L-DNA-1h to HFS-D-L-DNA-5h) underwent Illumina PE150 sequencing.

Sequencing read preprocessing

Before read identification, sequencing reads underwent preprocessing.

Step 1: Primer identification. For each read, the flanking primers were identified via pairwise alignment using Edlib (54). The boundaries of the data-carrying region were then determined according to the alignment path with a minimum edit distance. A lenient primer-trimming strategy was used to maximize read retention. The primers were then trimmed to extract the DNA payload. Notably, this step was omitted for shotgun-based recovery of large DNA fragments.

Step 2: Demapping. Each DNA payload sequence was demapped into a two-layer bit sequence. According to the accompanying property, a noisy LCRC segment and a noisy payload (both are in the form of bit vectors) of the same length were extracted.

Correlation-based read identification

Following the preprocessing step, read identification was performed. For rapid correlation-based read identification, the process involved three steps.

Step 1: SCC correlation. For each read, the cross-correlation function between the associated noisy LCRC segment and each ideal SCC was first computed using a sliding operation. The cross-correlation function is denoted as

RCˆ,Cˆi(τ)=k=0L1CˆkCˆi,kτ (3)

where L is the length of the noisy LCRC segment Cˆ (equal to the payload length) and τ is the phase offset of SCC Cˆi, 0 ≤ τ ≤ pi − 1. Then, a peak search algorithm was used to locate correlation peaks, yielding the estimated phase τi of each SCC

τi=argmaxτ(RCˆ,Cˆi(τ)) (4)

Step 2: Solve the starting site via the CRT. According to the CRT from number theory, the estimated phases of the SCCs satisfy the following congruence relation

{τ1φ mod p1τ2φ mod p2τnφ mod pn (5)

Note that gcd(pi, pj) = 1, i ≠ j, such that

P=p1p2pn=p1M1==pnMn (6)

Then, the unique starting site of the noisy LCRC segment (i.e., index) relative to the ideal LCRC was obtained

φ=i=1nτiMiMi mod P (7)

where MiMi mod pi=1. On this basis, the starting site of the associated payload was determined.

Step 3: Index double-check. On the basis of the inferred starting site, the noisy LCRC segment was correlated to the ideal LCRC segment with the same length once. If the final correlation value reached the predefined correlation threshold, then the associated payload was chosen for the subsequent consensus step.

Alignment-based read identification

Reads that failed the threshold check in the correlation-based identification were subjected to alignment-based identification. This approach used a two-step process for read identification and indel correction.

Step 1: LCRC alignment. The noisy LCRC segment was aligned to the ideal LCRC using BWA (41). The alignment path and the starting site of the noisy LCRC segment relative to the ideal LCRC were determined.

Step 2: Indel detection and correction. On the basis of the alignment path, indels within the noisy LCRC segment were identified. Owing to the accompanying property of the index and payload, these indels manifested at the same positions in the associated payload sequence. The indels were corrected to produce a polished payload sequence. Specifically, the insertions were removed, and the deletions were marked as erasures.

Consensus and error correction

Bit-wise consensus and error correction were performed to restore the original data after read identification.

Step 1: Bit-wise majority voting. On the basis of the identified indices, the multiple copies of the data segments were merged bit by bit to reconstruct the encoded data sequences. If the voting process failed to reach a decision, then the corresponding bit was marked as an erasure.

Step 2: Error and erasure correction. The embedded redundancy was used to correct residual errors and erasures in the consensus sequence, thereby enabling the error-free recovery of the original user data.

Data-carrying read identification in shotgun-sequenced readout of large DNA

Following demapping, SCC correlation, and start-site determination via the CRT, the index double-check was applied to filter out interference reads originating from the host genome and plasmid vector. Specifically, reads with correlation values above the preset correlation threshold were first classified as data-carrying sequences; otherwise, they were treated as interference. LCRC alignment can be further used to rescue misidentified reads from interference. Then, the identified data-carrying payloads were subjected to bit-wise consensus. The numerical results indicate that the normalized correlation values between the LCRC and the host genome or plasmid vector are below 0.5. It suggests that a higher correlation threshold can achieve accurate identification.

Two-stage setting in the progressive error correction

Read identification in the progressive error correction operates in two stages: SCC correlation and LCRC alignment. The choice of how these stages are deployed depends on the underlying error profile. Guided by the empirically observed read usage ratios for each stage (table S13), the workflow can be flexibly adapted to different error conditions.

1) For low-error scenarios (<1%, e.g., NGS), SCC correlation alone provides rapid and accurate read identification. Only SCC correlation is the optimal setting.

2) For moderate-error scenarios (~1 to 3%, e.g., nanopore sequencing with SUP/HAC base calling), two stages are both applied: SCC correlation efficiently filters low-error-rate reads, while LCRC alignment is applied to rescue reads with severe indels.

3) For severe-error scenarios (>7%, e.g., nanopore sequencing with FAST base calling), LCRC alignment only is preferred.

DNA data storage channel simulation

A realistic DNA data storage channel model that incorporates sequence-dependent errors and biases was established to simulate sequencing reads (fig. S55). The simulation comprised three stages (see text S4 for details):

Step 1: Synthesis simulation. Consecutive-deletion errors were introduced using an experimentally derived deletion-run distribution (17). Coverage bias was modeled using a normal distribution (55). A dilution step was then applied using binomial sampling, producing an average of ~10 copies per sequence.

Step 2: PCR amplification simulation. Substitution errors were injected on the basis of empirically reported error rates (17). PCR-induced coverage bias was modeled using a negative binomial distribution (55), and a GC-dependent scaling factor was applied (56).

Step 3: Sequencing simulation. Insertions (Pi), deletions (Pd), and substitutions (Ps) were introduced following a typical NGS error profile with Pi: Pd: Ps = 1:4:10 (6). Elevated indel rates in long homopolymer regions were modeled using a homopolymer length–dependent scaling factor (56). Reads were sampled at a target coverage using a binomial process (57).

In silico simulation

The error tolerance and efficiency of both correlation-based and alignment-based readout methods were comprehensively evaluated using simulated sequencing reads. All simulations were performed on servers equipped with Intel Xeon Platinum 8268 CPUs @2.90 GHz, comprising 192 cores and 6 TB of memory.

Step 1: Data encoding. The LCRCs were constructed using five well-chosen SCCs with average lengths of 81 and 543. Traditional ECCs, such as LDPC codes and product codes with various code rates, were used to encode the user data. The product codes used LDPC codes for row coding and RS codes for column coding. Subsequently, the LCRC and the encoded data were partitioned into equal-length segments and combined to form final DNA payload sequences.

Step 2: Simulation of reads. Each payload sequence was duplicated a random number of times following a Poisson distribution. Random base insertion, deletion, and substitution errors were introduced to generate simulated sequencing reads. A wide range of error rates was tested, with errors uniformly distributed along the sequences for simplicity.

Step 3: Decoding and recovery. Correlation-based and alignment-based readout methods were respectively employed to restore the original data from the simulated reads.

Large-scale simulation

Extensive simulations were performed to assess the scalability of the proposed scheme for large-scale data storage (figs. S13 and S14). A real-world dataset from the image dataset DOTA (a large-scale dataset for object detection in aerial images) was adopted as the original data (43). The data were encoded using LDPC-RS product codes with a code rate of 0.82. The simulated reads were generated using the realistic DNA data storage channel model (fig. S55), with a typical NGS error rate of 0.005 and Pi: Pd: Ps = 1:4:10. All simulations were performed on servers equipped with Intel Xeon Platinum 8268 CPUs @2.90 GHz, comprising 192 cores and 6 TB of memory.

For data storage with oligo pools, a variety of LCRCs constructed from three or five SCCs were tested. Specifically, for n = 5, SCCs with average lengths of 138, 543, 1041, 2076, and 4114 were used for GB-scale demonstrations. For n = 3, SCCs with average lengths of 527, 1031, 2064, and 4102 were used for MB-scale demonstrations. In both cases, the LCRCs and the encoded data were partitioned into segments of the same length (200 or 255 bits) and then combined bit by bit to form DNA payload sequences (200 or 255 nt).

For data storage with large DNA fragments, an LCRC was constructed from five SCCs with an average length of 81. A total of 97,000 large DNA fragments (~33 kb each) were generated and subjected to ART to simulate 150-nt shotgun-sequenced reads.

Real-time readout using nanopore sequencing

The real-time readout pipeline for the large-scale oligo pools using the nanopore sequencer is as follows (text S6).

Step 1: Sequencer initialization. The prepared DNA library of LCS-Pool-960K was loaded into two PromethION flow cells (R10.4.1), and the sequencer was started. Before DNA sequencing began, the sequencer was initialized, including chip heating and pore scanning.

Step 2: Simultaneous sequencing, base calling, and read-by-read decoding. Once a sequencing read was available, the decoding procedure was implemented. Alignment-based identification was performed read by read, followed by bit-wise consensus based on the identified indices to reconstruct the encoded data sequence. When the base available ratio within the reconstructed sequence reached the preset threshold, error correction was attempted. The process was repeated until enough coverage was obtained to restore the original data error free (movies S2 to S4). The high-accuracy model of the base caller Dorado (v7.6.8) was adopted.

Real-time readout of LCS-Pool-300K was also performed using two PromethION flow cells (R10.4.1) (text S7 and movies S5 to S7). The small-scale DNA library of HFS-Pool-11.7K was real-time sequenced using a MinION flow cell (R10.4.1) (text S8 and movie S8).

Acknowledgments

We would like to thank C. Han for constructive discussions regarding the design and improvement of this work.

Funding:

This work was supported by the National Key R&D Program of China under the grants 2023YFA0913800 and 2024YFF1500500 to W.C.

Author contributions:

W.C. conceived the study and designed the indexing and readout correction scheme. Y.Z., Q. Ge, and Q.Gu. developed the encoding and decoding software. Y.Z. performed the simulation verification. R.Q. performed the in vitro and in vivo experiments. Y.Z., R.Q., Q. Ge, and Q.Gu. processed the data. W.C., Y.Z., and R.Q. prepared the figures and wrote the manuscript. All authors reviewed and edited the manuscript.

Competing interests:

W.C., Y.Z., and R.Q. have been granted a Chinese patent (CN119476422B) and report a US patent application (US 19/380,016) that relates to a DNA indexing scheme using composite ranging codes. All other authors declare that they have no competing interests.

Data, code, and materials availability:

All data and code needed to evaluate and reproduce the results in the paper are present in the paper and/or the Supplementary Materials. The raw sequencing data supporting the findings of this study have been deposited in the NCBI Sequence Read Archive (SRA) under accession no. PRJNA1371011. The source code for data retrieval using the proposed scheme is archived in Zenodo and are available for download via https://doi.org/10.5281/zenodo.17911697. A long-term maintained version of the source code is available at https://github.com/dna-storage-lab/DNAStorage_LCRC. This study did not generate any new materials.

Supplementary Materials

The PDF file includes:

Supplementary Text S1 to S9

Figs. S1 to S83

Tables S1 to S20

Legends for movies S1 to S8

References

sciadv.aec1469_sm.pdf (15.5MB, pdf)

Other Supplementary Material for this manuscript includes the following:

Movies S1 to S8

REFERENCES

  • 1.Church G. M., Gao Y., Kosuri S., Next-generation digital information storage in DNA. Science 337, 1628 (2012). [DOI] [PubMed] [Google Scholar]
  • 2.Goldman N., Bertone P., Chen S., Dessimoz C., LeProust E. M., Sipos B., Birney E., Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494, 77–80 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Ceze L., Nivala J., Strauss K., Molecular digital data storage using DNA. Nat. Rev. Genet. 20, 456–466 (2019). [DOI] [PubMed] [Google Scholar]
  • 4.Grass R. N., Heckel R., Puddu M., Paunescu D., Stark W. J., Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angew. Chem. Int. Ed. Engl. 54, 2552–2555 (2015). [DOI] [PubMed] [Google Scholar]
  • 5.Erlich Y., Zielinski D., DNA fountain enables a robust and efficient storage architecture. Science 355, 950–954 (2017). [DOI] [PubMed] [Google Scholar]
  • 6.Organick L., Ang S. D., Chen Y.-J., Lopez R., Yekhanin S., Makarychev K., Racz M. Z., Kamath G., Gopalan P., Nguyen B., Takahashi C. N., Newman S., Parker H.-Y., Rashtchian C., Stewart K., Gupta G., Carlson R., Mulligan J., Carmean D., Seelig G., Ceze L., Strauss K., Random access in large-scale DNA data storage. Nat. Biotechnol. 36, 242–248 (2018). [DOI] [PubMed] [Google Scholar]
  • 7.Tabatabaei S. K., Wang B., Athreya N. B. M., Enghiad B., Hernandez A. G., Fields C. J., Leburton J.-P., Soloveichik D., Zhao H., Milenkovic O., DNA punch cards for storing data on native DNA sequences via enzymatic nicking. Nat. Commun. 11, 1742 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Zhang C., Wu R., Sun F., Lin Y., Liang Y., Teng J., Liu N., Ouyang Q., Qian L., Yan H., Parallel molecular data storage by printing epigenetic bits on DNA. Nature 634, 824–832 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Ge Q., Qin R., Liu S., Guo Q., Han C., Chen W., Pragmatic soft-decision data readout of encoded large DNA. Brief. Bioinform. 26, bbaf102 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Chen W., Han M., Zhou J., Ge Q., Wang P., Zhang X., Zhu S., Song L., Yuan Y., An artificial chromosome for data storage. Natl. Sci. Rev. 8, nwab028 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Ping Z., Chen S., Zhou G., Huang X., Zhu S. J., Zhang H., Lee H. H., Lan Z., Cui J., Chen T., Zhang W., Yang H., Xu X., Church G. M., Shen Y., Towards practical and robust DNA-based data archiving using the yin–yang codec system. Nat. Comput. Sci. 2, 234–242 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Shomorony I., Vahid A., Torn-paper coding. IEEE Trans. Inf. Theory 67, 7904–7913 (2021). [Google Scholar]
  • 13.Bar-Lev D., Marcovich S., Yaakobi E., Yehezkeally Y., Adversarial torn-paper codes. IEEE Trans. Inf. Theory 69, 6414–6427 (2023). [Google Scholar]
  • 14.Bar-Lev D., Sabary O., Yaakobi E., The zettabyte era is in our DNA. Nat. Comput. Sci. 4, 813–817 (2024). [DOI] [PubMed] [Google Scholar]
  • 15.Tomek K. J., Volkel K., Simpson A., Hass A. G., Indermaur E. W., Tuck J. M., Keung A. J., Driving the scalability of DNA-based information storage systems. ACS Synth. Biol. 8, 1241–1248 (2019). [DOI] [PubMed] [Google Scholar]
  • 16.Heckel R., Mikutis G., Grass R. N., A characterization of the DNA data storage channel. Sci. Rep. 9, 9663 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Gimpel A. L., Stark W. J., Heckel R., Grass R. N., A digital twin for DNA data storage based on comprehensive quantification of errors and biases. Nat. Commun. 14, 6026 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Doricchi A., Platnich C. M., Gimpel A., Horn F., Earle M., Lanzavecchia G., Cortajarena A. L., Liz-Marzán L. M., Liu N., Heckel R., Grass R. N., Krahne R., Keyser U. F., Garoli D., Emerging approaches to DNA data storage: Challenges and prospects. ACS Nano 16, 17552–17571 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Hawkins J. A., Jones S. K., Finkelstein I. J., Press W. H., Indel-correcting DNA barcodes for high-throughput sequencing. Proc. Natl. Acad. Sci. U.S.A. 115, E6217–E6226 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Matange K., Tuck J. M., Keung A. J., DNA stability: A central design consideration for DNA data storage systems. Nat. Commun. 12, 1358 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Meiser L. C., Gimpel A. L., Deshpande T., Libort G., Chen W. D., Heckel R., Nguyen B. H., Strauss K., Stark W. J., Grass R. N., Information decay and enzymatic information recovery for DNA data storage. Commun. Biol. 5, 1117 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Gimpel A. L., Stark W. J., Heckel R., Grass R. N., Challenges for error-correction coding in DNA data storage: Photolithographic synthesis and DNA decay. Digit. Discov. 3, 2497–2508 (2024). [Google Scholar]
  • 23.Zhirnov V., Zadegan R. M., Sandhu G. S., Church G. M., Hughes W. L., Nucleic acid memory. Nat. Mater. 15, 366–370 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Motahari A. S., Bresler G., Tse D. N. C., Information theory of DNA shotgun sequencing. IEEE Trans. Inf. Theory 59, 6273–6289 (2013). [Google Scholar]
  • 25.Press W. H., Hawkins J. A., Jones S. K., Schaub J. M., Finkelstein I. J., HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints. Proc. Natl. Acad. Sci. U.S.A. 117, 18489–18496 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Blawat M., Gaedke K., Huetter I., Chen X., Turczyk B., Inverso S., Pruitt B. W., Church G. M., Forward error correction for DNA data storage. Procedia Comput. Sci. 80, 1011–1022 (2016). [Google Scholar]
  • 27.Antkowiak P. L., Lietard J., Darestani M. Z., Somoza M. M., Stark W. J., Heckel R., Grass R. N., Low cost DNA data storage using photolithographic synthesis and advanced information reconstruction and error correction. Nat. Commun. 11, 5345 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Qu G., Yan Z., Wu H., Clover: Tree structure-based efficient DNA clustering for DNA-based data storage. Brief. Bioinform. 23, bbac336 (2022). [DOI] [PubMed] [Google Scholar]
  • 29.Bar-Lev D., Orr I., Sabary O., Etzion T., Yaakobi E., Scalable and robust DNA-based storage via coding theory and deep learning. Nat. Mach. Intell. 7, 639–649 (2025). [Google Scholar]
  • 30.Wright E., Accurately clustering biological sequences in linear time by relatedness sorting. Nat. Commun. 15, 3047 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Song L., Geng F., Gong Z.-Y., Chen X., Tang J., Gong C., Zhou L., Xia R., Han M.-Z., Xu J.-Y., Li B.-Z., Yuan Y.-J., Robust data storage in DNA by de Bruijn graph-based de novo strand assembly. Nat. Commun. 13, 5361 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Consultative Committee for Space Data Systems, (CCSDS), “Pseudo-noise (PN) ranging systems” (Publication 414.1-B-3, CCSDS, 2022); public.ccsds.org/Pubs/414x1b3e1.pdf.
  • 33.Modenini A., Ripani B., A tutorial on the tracking, telemetry, and command (TT&C) for space missions. IEEE Commun. Surv. Tutor. 25, 1510–1542 (2023). [Google Scholar]
  • 34.Boscagli G., Holsters P., Vassallo E., Visintin M., PN regenerative ranging and its compatibility with telecommand and telemetry signals. Proc. IEEE 95, 2224–2234 (2007). [Google Scholar]
  • 35.Cappuccio P., Notaro V., Di Ruscio A., Iess L., Genova A., Durante D., Di Stefano I., Asmar S. W., Ciarcia S., Simone L., Report on first inflight data of BepiColombo’s Mercury orbiter radio science experiment. IEEE Trans. Aerosp. Electron. Syst. 56, 4984–4988 (2020). [Google Scholar]
  • 36.Chen W., He Y., Han C., Yang J., Xu Z., Bit-level composite signal design for simultaneous ranging and communication. China Commun. 18, 214–227 (2021). [Google Scholar]
  • 37.Titsworth R. C., Optimal ranging codes. IEEE Trans. Space Electron. Telem. 10, 19–30 (1964). [Google Scholar]
  • 38.Weindel F., Gimpel A. L., Grass R. N., Heckel R., Embracing errors can be more efficient than avoiding them through constrained coding for DNA data storage. IEEE Trans. Mol. Biol. Multi Scale Commun. 12, 146–156 (2026). [Google Scholar]
  • 39.Doroschak K., Zhang K., Queen M., Mandyam A., Strauss K., Ceze L., Nivala J., Rapid and robust assembly and decoding of molecular tags with DNA-based nanopore signatures. Nat. Commun. 11, 5454 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Chen W., Wang L., Han M., Han C., Li B., Sequencing barcode construction and identification methods based on block error-correction codes. Sci. China Life Sci. 63, 1580–1592 (2020). [DOI] [PubMed] [Google Scholar]
  • 41.Li H., Durbin R., Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Cideciyan R. D., Furrer S., Lantz M. A., Product codes for data storage on magnetic tape. IEEE Trans. Magn. 53, 1–10 (2017). [Google Scholar]
  • 43.G. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, M. Pelillo, L. Zhang, DOTA: A large-scale dataset for object detection in aerial images, in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (IEEE, 2018), pp. 3974–3983. [Google Scholar]
  • 44.Milstein L., Some statistical properties of combination sequences. IEEE Trans. Inf. Theory 23, 254–258 (1977). [Google Scholar]
  • 45.Katoh K., Misawa K., Kuma K., Miyata T., MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059–3066 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Zhao X., Li J., Fan Q., Dai J., Long Y., Liu R., Zhai J., Pan Q., Li Y., Composite Hedges Nanopores codec system for rapid and portable DNA data readout with high INDEL-Correction. Nat. Commun. 15, 9395 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.S. Chandak, J. Neu, K. Tatwawadi, J. Mardia, B. Lau, M. Kubit, R. Hulett, P. Griffin, M. Wootters, T. Weissman, H. Ji, Overcoming high nanopore basecaller error rates for DNA storage via basecaller-decoder integration and convolutional codes, in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2020), pp. 8822–8826. [Google Scholar]
  • 48.Lopez R., Chen Y.-J., Dumas Ang S., Yekhanin S., Makarychev K., Racz M. Z., Seelig G., Strauss K., Ceze L., DNA assembly for nanopore data storage readout. Nat. Commun. 10, 2933 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.W. Chen, C. Liang, T. Guo, Y. Ding, Encoder implementation with FPGA for nonbinary LDPC codes, in the 18th Asia-Pacific Conference on Communications (APCC) (IEEE, 2012), pp. 980–984. [Google Scholar]
  • 50.Zerbino D. R., Birney E., Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Huang W., Li L., Myers J. R., Marth G. T., ART: A next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Kovaka S., Fan Y., Ni B., Timp W., Schatz M. C., Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED. Nat. Biotechnol. 39, 431–441 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Tabatabaei S. K., Pham B., Pan C., Liu J., Chandak S., Shorkey S. A., Hernandez A. G., Aksimentiev A., Chen M., Schroeder C. M., Milenkovic O., Expanding the molecular alphabet of DNA-based data storage systems with neural network nanopore readout processing. Nano Lett. 22, 1905–1914 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Šošić M., Šikić M., Edlib: A C/C ++ library for fast, exact sequence alignment using edit distance. Bioinformatics 33, 1394–1395 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Chen Y.-J., Takahashi C. N., Organick L., Bee C., Ang S. D., Weiss P., Peck B., Seelig G., Ceze L., Strauss K., Quantifying molecular bias in DNA data storage. Nat. Commun. 11, 3264 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Ross M. G., Russ C., Costello M., Hollinger A., Lennon N. J., Hegarty R., Nusbaum C., Jaffe D. B., Characterizing and measuring bias in sequence data. Genome Biol. 14, R51 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Yuan L., Xie Z., Wang Y., Wang X., DeSP: A systematic DNA storage error simulation pipeline. BMC Bioinf. 23, 185 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Sarwate D. V., Pursley M. B., Crosscorrelation properties of pseudorandom and related sequences. Proc. IEEE 68, 593–619 (1980). [Google Scholar]
  • 59.Boehmer A., Binary pulse compression codes. IEEE Trans. Inf. Theory 13, 156–167 (1967). [Google Scholar]
  • 60.Schwarz M., Welzel M., Kabdullayeva T., Becker A., Freisleben B., Heider D., MESA: Automated assessment of synthetic DNA fragments and simulation of DNA synthesis, storage, sequencing and PCR errors. Bioinformatics 36, 3322–3326 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Chaykin G., Sabary O., Furman N., Shabat D. B., Yaakobi E., Dna-storalator: A computational simulator for DNA data storage. BMC Bioinf. 26, 204 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Hughes R. A., Ellington A. D., Synthetic DNA synthesis and assembly: Putting the synthetic in synthetic biology. Cold Spring Harb. Perspect. Biol. 9, a023812 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Baichoo S., Ouzounis C. A., Computational complexity of algorithms for sequence comparison, short-read assembly and genome alignment. Biosystems 156-157, 72–85 (2017). [DOI] [PubMed] [Google Scholar]
  • 64.Lam T. W., Sung W. K., Tam S. L., Wong C. K., Yiu S. M., Compressed indexing and local alignment of DNA. Bioinformatics 24, 791–797 (2008). [DOI] [PubMed] [Google Scholar]
  • 65.Zhong C., Zhang S., Accurate and efficient mapping of the cross-linked microRNA-mRNA duplex reads. iScience 18, 11–19 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Zadeh J. N., Steenberg C. D., Bois J. S., Wolfe B. R., Pierce M. B., Khan A. R., Dirks R. M., Pierce N. A., NUPACK: Analysis and design of nucleic acid systems. J. Comput. Chem. 32, 170–173 (2011). [DOI] [PubMed] [Google Scholar]
  • 67.Zhang J., Kobert K., Flouri T., Stamatakis A., PEAR: A fast and accurate Illumina paired-end read merger. Bioinformatics 30, 614–620 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Yazdi S. M. H. T., Gabrys R., Milenkovic O., Portable and error-free DNA-based data storage. Sci. Rep. 7, 5011 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Text S1 to S9

Figs. S1 to S83

Tables S1 to S20

Legends for movies S1 to S8

References

sciadv.aec1469_sm.pdf (15.5MB, pdf)

Movies S1 to S8

Data Availability Statement

All data and code needed to evaluate and reproduce the results in the paper are present in the paper and/or the Supplementary Materials. The raw sequencing data supporting the findings of this study have been deposited in the NCBI Sequence Read Archive (SRA) under accession no. PRJNA1371011. The source code for data retrieval using the proposed scheme is archived in Zenodo and are available for download via https://doi.org/10.5281/zenodo.17911697. A long-term maintained version of the source code is available at https://github.com/dna-storage-lab/DNAStorage_LCRC. This study did not generate any new materials.


Articles from Science Advances are provided here courtesy of American Association for the Advancement of Science

RESOURCES