Error characterization and error correction approaches in combinatorial DNA-based storage

Inbal Preuss; Omer Sabary; Ryan Gabrys; Zohar Yakhini; Eitan Yaakobi; Leon Anavy

doi:10.1038/s41598-026-38599-0

. 2026 Feb 10;16:8093. doi: 10.1038/s41598-026-38599-0

Error characterization and error correction approaches in combinatorial DNA-based storage

Inbal Preuss ^1,^2,^✉,^#, Omer Sabary ^2,^✉,^#, Ryan Gabrys ³, Zohar Yakhini ^1,², Eitan Yaakobi ², Leon Anavy ¹

PMCID: PMC12960821 PMID: 41667543

Abstract

Data storage in DNA has recently emerged as a promising archival solution, offering space-efficient and long-lasting digital storage. DNA’s ultra-high density and durability make it an attractive medium for long-term data storage. Among the various approaches, combinatorial DNA encoding further enhances this potential by increasing the logical density through the use of combinations of DNA shortmers, where each sequence position is represented by a set of predefined short DNA fragments. This approach allows for the encoding of a larger data volume using fewer synthesis cycles. However, this method introduces unique challenges, particularly in terms of synthesis and sequencing errors. In this study, we focus on the characterization of errors in combinatorial DNA-based storage systems. Our analysis revealed that asymmetric combinatorial erasure errors, defined as the omission of a single shortmer from the set defining the combinatorial letter, are a prevalent error type in combinatorial DNA-based storage, particularly in large-scale systems where read coverage is limited. We analyzed two previously published datasets, observing a high frequency of erasure errors, where missing sequences obstruct the reconstruction of specific combinatorial letters. To better understand these observations, we conducted a large-scale experimental proof-of-concept of a combinatorial DNA-based storage system and evaluated the error characteristics of the system. Our analysis confirmed that erasure errors become increasingly prominent with reduced sequencing depth. We demonstrated that below 50 reads per sequence, the frequency of erasure errors sharply increased. We developed an asymmetric error-correcting code specifically designed to address these errors. The code utilizes tensor-product (TP) codes to integrate standard erasure and substitution-correcting codes (such as Reed-Solomon (RS) codes) with Varshamov-Tenengolts (VT) codes, which are asymmetric error-correcting codes. We validated the performance of our new error correction code both in simulations and in a second large-scale experiment. The experimental comparison was designed to directly compare our suggested code with the more straightforward 2D Reed-Solomon (2D RS) scheme. Our method consistently outperformed the 2D RS scheme, particularly in scenarios dominated by erasure errors. Notably, in the second large-scale experiment, our method demonstrated superior decoding accuracy even under low coverage conditions, where the traditional 2D RS approach struggled to decode the data. Our findings demonstrate the importance of tailored error correction schemes in DNA-based data storage. By directly addressing the asymmetric nature of errors in combinatorial DNA, our method provides improved decoding accuracy under a wide range of conditions. The integration of such tailored error correction methods with existing DNA-based data storage technologies has the potential to deliver more reliable and scalable DNA-based data applications.

Subject terms: Engineering, Mathematics and computing

Introduction

The global data landscape is expanding rapidly, with estimates placing it at 180 zettabytes by the end of 2025 and projections suggesting it will exceed 400 zettabytes by 2028¹. Current storage devices, such as magnetic tapes, hard disks, and solid-state drives, are struggling to meet this demand^2,3, mostly because of limitations in physical space and energy consumption. As an alternative solution, DNA-based data storage systems has been suggested, offering high density, long-span durability, and low maintenance energy consumption. Since 2012, various studies presented successful DNA-based storage solutions, both in academia and industry^4–13. In addition to these works, several recent studies have proposed new encoding approaches for DNA-based data storage. Wu et al.¹⁴ introduced the Repeating Substring Tree Encoding (RSTE) method, which identifies repeated substrings in the source data and encodes them as DNA motifs to improve the encoding rate while ensuring that the resulting sequences satisfy standard biochemical constraints such as run-length limits, GC balance, and end constraints. Other examples include a parity encoding and local mean iteration (PELMI) scheme for robust DNA image storage¹⁵ and an invertible neural network–based DNA image storage method (INNSE) that applies up–down sampling to reduce the number of required DNA sequences and uses an internal base-level self-correction mechanism without additional redundancy¹⁶.

In general, the in-vitro DNA-based storage pipeline works as follows. First, the user’s information is encoded into DNA sequences (sequences over the alphabet of A, C, G, T) based on a predefined coding scheme. These sequences are termed encoded sequences. Next, molecules representing the encoded sequences are generated using a biochemical process termed DNA synthesis, which produces single-stranded DNA oligonucleotides (or in short, oligos). These synthesized oligos, which now store the user’s information, are placed together, typically unordered, in a small storage container, usually a vial. To retrieve the data, a small sample of the oligos is taken, and using DNA sequencing, the nucleotide base sequences are read back and decoded. Since both synthesis and sequencing introduce errors, it is necessary to encode the information with an error-correcting code (ECC)^7–10,13,17. Thus, by including redundancy within the encoded sequences, it is possible to create a system that maintains data reliability even in the presence of errors.

Biological and biochemical constraints are fundamental to the design of reliable DNA-based data storage systems. Synthesized oligos must satisfy requirements such as maintaining appropriate GC content, avoiding long homopolymer runs, minimizing secondary structure formation, and preventing unintended cross-hybridization or self-priming during PCR and sequencing. These considerations have motivated a range of constrained coding techniques, including mutually uncorrelated (MU) codes and k-weakly mutually uncorrelated (k-WMU) codes, which restrict long prefix-suffix overlaps and are commonly used to design unique sequences that improve access efficiency in DNA-based storage systems. Recent work by Liu et al.¹⁶ presented systematic constructions of MU and k-WMU code families that explicitly account for multiple biological and biochemical constraints during sequence design. While our work focuses on combinatorial DNA encoding, these biological and biochemical constraints must also be taken into consideration in the design and analysis of combinatorial DNA-based storage systems.

A major bottleneck that still limits the adoption of DNA-based data storage solutions is the cost and latency of DNA synthesis¹⁸. On the other hand, most of the synthesis methods have inherent redundancy, which is reflected by the fact they generate multiple copies per oligo, usually in the order of thousands to hundred thousands¹⁹. To leverage this bottleneck, the idea of Composite DNA has been suggested in 2019 by Anavy et al.¹⁰ and by Choi et al²⁰. These methods use the inherent redundancy of the synthesis to introduce more alphabet letters beyond the standard Inline graphic . Therefore, the potential information density increases, meaning that more information can be encoded using the same number of synthesis cycles. Informally speaking, a composite DNA letter is a mixture consisting of more than one nucleotide in a predetermined ratio. This predetermined mixture is represented as randomly synthesizing different oligos on different molecules in the same position. Thus, to identify the composite letter, it is required to sequence a fraction of these multiple copies and to estimate the original composition of the different nucleotides. Although this method increases capacity, encoding using composite DNA introduces new errors and thus requires specified error correction codes. Error-correcting codes for composite DNA were suggested by Walter et. al²¹, while results related to the information capacity of this channel were given by Cohen et al.²² and Kobovich et al.^23,24.

Another method to increase logical density and reduce overall costs in DNA-based storage systems is to use the combinatorial DNA synthesis method, which have recently been suggested^25–27. Molecular mechanisms for generating combinatorial sets of DNA sequences include combinatorial DNA synthesis and combinatorial DNA assembly. Nonetheless, they all share a common encoding approach of using combinations of DNA sequences as the data-holding unit in the system. These common characteristics was previously referred to as combinatorial DNA encoding and formally defined as follows²⁵. Let Inline graphic be a set of N DNA k-mers that are used as building blocks and selected to minimize mix-up errors (e.g., by ensuring a minimal distance d between each pair). A combinatorial alphabet is defined such that each letter represents a subset of K k-mers from , forming an alphabet of size . A binary message of length B bits is encoded by generating a sequence of M combinatorial letters Inline graphic , where . Fig. 1 illustrates this combinatorial encoding approach presenting an example combinatorial alphabet with and and the encoding of a 42 bits message using a 7 letter combinatorial sequence. After encoding the input binary message into a set of combinatorial DNA sequences, various molecular techniques can be used to generate a pool of DNA molecules. Each actual oligo or molecule holds a single combination of the DNA k-mers and collectively, the pool represents the original set of combinatorial sequences. After sequencing a sample of the DNA molecules, reconstruction of the combinatorial sequences involves three key steps. First, the reads are split into groups where each group represents a single combinatorial sequence. This step is typically done using a barcode sequence located at a predefined position of each DNA molecule. Next, each combinatorial sequence is reconstructed independently from the grouped reads, one position at a time, by detecting K unique k-mers in the relevant position and identifying the corresponding combinatorial letter. Lastly, the original message is decoded by applying error-correcting codes (Refer to Fig. 2 in²⁵ for details on the combinatorial encoding pipeline).

Fig. 1 — Schematic overview of a combinatorial alphabet and encoding. (a) The plot describes , the set of building block shortmers of length . (b) The building block shortmers can create a set of 70 combinatorial letters, each is defined a selection of shortmers. (c) Since , all the possible vectors of 6 bits are mapped into 64 combinatorial letters. (d) An example of a combinatorial sequence of length , represented as a matrix. Each column corresponds to a combinatorial letter, and each row represents a shortmer building block. The cells corresponding to the selected shortmers in each column are highlighted in light blue.

Inline graphic — Schematic overview of a combinatorial alphabet and encoding. (a) The plot describes , the set of building block shortmers of length . (b) The building block shortmers can create a set of 70 combinatorial letters, each is defined a selection of shortmers. (c) Since , all the possible vectors of 6 bits are mapped into 64 combinatorial letters. (d) An example of a combinatorial sequence of length , represented as a matrix. Each column corresponds to a combinatorial letter, and each row represents a shortmer building block. The cells corresponding to the selected shortmers in each column are highlighted in light blue.

Fig. 2 — Analysis of experimental data from Yan *et al.*²⁷. (a) Sequence design: each construct consists of a unique barcode (BCi), a universal sequence (U1), and a payload segment (P1). The universal sequence is fixed across all constructs in the same cycle and facilitates the assembly process. (b) Read length distribution, the dotted red line represents the filtering threshold used for analysis. (c) Distribution of the start position of U1 in the analyzed sequences. The dotted red line represents the designed start position. (d) Evidence for erasure errors. X-axis is the average reads per sequence, and y-axis is the number of unique shortmers observed out of 32, with , , , 8 different sequences, the error bars are calculated over 10 repeats of randomly down-sampling the actual data.

Standard DNA-based storage systems use error correction to address common synthesis and sequencing errors, such as single-base substitutions, insertions, and deletions. These symbol-level errors are addressed in most studies by applying error correction codes and constrained coding. Another frequent issue in DNA-based data storage systems is sequence dropout, where certain sequences are altogether missing from the NGS output, often due to biases in molecular biology processes and the effect of sampling in the readout process. To address this, sequence-level error correction codes are used allowing the recovery of complete missing or corrupt sequences from the message. Combining symbol-level (inner) and sequence-level (outer) error correction codes forms a 2-dimensional (2D) error correction scheme. A common method for addressing both symbol and sequence-level errors is leveraging the inherent multiplicity in DNA synthesis and sequencing. Increasing the sampling rate improves sequence coverage, reduces dropouts, and corrects symbol-level errors through consensus-based methods^28,29. However, when considering large scale systems, the ability to increase sequence coverage may be limited and also caries significant costs. Preuss et al.³⁰ applied a variant of the coupon collector distribution to analyze and determine the required sequencing coverage in combinatorial DNA-based data storage systems. In contrast to standard DNA-based systems, the reconstruction of a combinatorial letter depends on observing at least one copy of each k-mer of the 8 composition, making sequence coverage a critical parameter. Even after observing K unique k-mers, we are not guaranteed to recover the correct combinatorial letter due to possible errors. A proposed solution is to observe each k-mer multiple times before inferring the combinatorial letter from the K most frequent k-mers. This required multiplicity (t) is adjustable and its effect on the overall required sequencing depth is also analyzed in literature²⁸.

Optimal error correction schemes rely on a proper characterization of the information channel and the expected error types. To properly characterize the combinatorial DNA channel, we analyzed two previously published datasets focusing on the observed error patterns, particularly when down-sampling the data. Our analysis revealed that erasure errors, where combinatorial letters with missing k-mers are observed, were a dominant error type in both datasets, which represent different synthesis protocols and sequencing technologies. While these “asymmetric combinatorial erasures” can arise as a direct consequence of base-level insertions, deletions, or substitutions that render certain shortmers unidentifiable, our analysis indicates that they also reflect higher-level misrepresentation events specific to the combinatorial structure, such as the absence of a shortmer in a given cycle. This observation highlighted the critical impact of read depth on decoding accuracy in combinatorial DNA data storage. Given the nature of some combinatorial DNA experimental systems, where high variations in sequence length and structure are observed, we developed a BLAST-based approach for sequence analysis. This method enabled accurate identification of k-mers within each read, even in cases where synthesis and sequencing errors disrupted the expected sequence structure. To further investigate these observations we developed, together with Helixworks Technologies Ltd, a large-scale experimental proof-of-concept of a combinatorial DNA-based data storage system. The experiment, including 640 combinatorial sequences synthesized using 8 cycles, represents the first implementation of a large-scale combinatorial DNA-based storage system. We used the data from this experiment to evaluate the error characteristics of combinatorial DNA. Our analysis of the sequencing data confirmed a high prevalence of erasure errors, particularly when the read depth was reduced. We demonstrated that when the number of sequencing reads per sequence is low, the frequency of erasure errors sharply increased, significantly impacting decoding accuracy. In response to these findings, we developed an asymmetric error-correcting code, specifically designed to address asymmetric errors in combinatorial DNA-based data storage³¹. This k-mer level error correction is introduced on top of every base-level error correction that is being used to specifically mitigate the remaining asymmetric combinatorial errors. We compared this error correction scheme to existing error correction approaches in combinatorial DNA-based storage using simulations. Since the combinatorial DNA channel differs fundamentally from the nucleotide-level channels considered in most prior DNA storage works, we restrict quantitative comparisons to coding schemes operating under comparable combinatorial alphabets and error models. Under these conditions, our method consistently outperformed 2-dimensional Reed-Solomon approaches (2D RS), particularly in scenarios dominated by erasure errors, reflecting the real-world conditions observed in the analyzed datasets. Finally, we performed a second large scale experiment, which was a direct experimental comparison using 320 sequences to evaluate the performance of our new error correction method against 320 sequences of the 2D RS approach. Our method demonstrated superior performance in recovering sequences with erasure errors, particularly under low coverage conditions, where 2D RS struggled. These results validate the effectiveness of our approach, establishing a clear advantage over traditional methods in real-world scenarios. Our study not only provides a detailed understanding of the error characteristics in combinatorial DNA-based data storage but also offers a practical, scalable error correction solution for improving data integrity in the context of this emerging technology.

Results

Error characterization of previously published datasets

To characterize the channel properties of combinatorial DNA based storage systems, we analyzed two previously published datasets ^25,27. While both experiments demonstrated the encoding of data using combinatorial DNA, they differ in their sequencing technologies and synthesis protocols; Yan et al. use Bridge Oligo Assembly and Preuss et al. use Gibson assembly. Table 1 includes the design parameters for both experiments.

Table 1.

Design parameters for the datasets analyzed in this section.

Parameter	Yan et al.²⁷	Preuss et al.²⁵
Message length (bits)	84	96
Sequence length/number of cycles (M)	1	4
Number of sequences	8 (repeats)	2
Logical density (bits/synthesis-cycle)	84	12
Number of unique k-mers (N)	96	16
Number of unique k-mers in each letter (K)	32	5
Sequence length in bases	74 bp	220 bp
k-mer length	25 bp	20 bp
Synthesis protocol	Bridge Oligo Assembly	Gibson Assembly
Sequencing method	Nanopore	Illumina (MiSeq)

Open in a new tab

We start by analyzing the data from Yan et al.²⁷. The experiment included a single combinatorial synthesis cycle ( Inline graphic ) in which two DNA shortmers (i.e. a barcode and a single payload) are assembled using Bridge Oligonucleotide Assembly. The payload shortmer represents a single combinatorial letter with and encoding an 84-bit ”HelloWorld” message. This experiment was repeated to generate 8 different sequences where each sequence contains a unique barcode sequence (BC), a universal sequence (U1), and a combinatorial payload sequence (P1). The universal sequence (U1) is a short, fixed sequence that is identical across all sequences in this cycle and is used in the assembly process. Typically, each cycle has its own distinct universal sequence, which we denote by Ui, where i corresponds to the cycle index. The sequence design in depicted in Fig. 2a. The resulting molecules were sequenced using Oxford Nanopore MinION technology, generating 102,222 reads to be analyzed.

First we examined read length distribution of the analyzed data (See Fig. 2b) and observed an average read length of 174 nt with a standard deviation of 62 nt and a median length of 164 nt. We filtered the reads to retain only those longer than 50 nucleotides. Since many of the reads were much longer than the expected 74 nt, locating the universal sequence required a dedicated algorithm.

For that, we employed Basic Local Alignment Search Tool (BLAST)³² to locate the U1 sequence position in each read. We then used this position for matching the BC and payload sequences adjacent to it. Details on this process are further elaborated in Methods BLAST-based analysis of the data from Yan et al.²⁷.

Analysis of the start positions of U1, depicted in Fig. 2c, revealed that the U1 start positions varied significantly across the reads (average = 174, median = 164, sd = 62). This variability further justified the need to use the a dedicated algorithm approach to identify the universal start position using BLAST-based algorithm. The reads were then split according to the identified barcode, distinguishing the 8 repeats, resulting in an average of Inline graphic 1,000 reads per sequence. After keeping only reads in which both the barcode sequence and a valid payload sequence were identified, an effective dataset of only 150 reads per sequence was generated. See sequence counts in Supplementary Table 1. As illustrated in Fig. 2d these conditions led to a failure in recovering all 32 shortmers, highlighting significant deficiencies in read quantity and quality while also underscoring the challenges in data recovery despite complex decoding algorithms.

Next, we analyzed data taken from the study by Preuss et al.²⁵, which used a combinatorial Gibson assembly protocol and NGS sequencing using Illumina Miseq. This experiment included two combinatorial sequences, encoding 96 bits, which code the message “DNA Storage!”. Each sequence comprised of Inline graphic cycles, using a combinatorial alphabet with , and unique shortmers. This resulted in DNA molecules of length 220 nucleotides, each containing 6 universal sequences. The sequencing output included 2,634,683 total reads, of which 2,151,661 were of the expected length of 220 nucleotides (average = 217, median = 220, sd = 10, see Fig. 3a).

Fig. 3 — Experiment analysis of Preuss et al.²⁵. (a) Read length distribution, the red dotted line represents the correct sequence length (220nt) (b) Evidence of erasure errors from a downsampling simulation of sequencing reads. The X-axis represents the average number of reads per sequence, and the Y-axis shows the number of unique shortmers observed out of 5. The simulation was performed with parameters , , and , over 2 different sequences. Error bars indicate the standard deviation over 10 independent down-sampling repeats.

Our analysis focused only on the sequences of length 220. For the first sequence, which was barcoded by BC1, there were 1,365,295 sequences, and for the second sequence (barcoded by BC2), we had 768,755 sequences, as detailed in Supplementary Table 1. The high quality of the sequencing results in this experiment made the retrieval of the barcode, universal sequences, and the 5 payload sequences in each cycle relatively easy. As a result, the original message was successfully recovered even after down-sampling the sequencing output to include an average of 100 reads per sequence (See Fig. 3b). Further reducing the sequencing depth resulted in a failure to recover the original message. This was caused by a failure to detect all 5 shortmers in each cycle.

The results from the analysis of the two datasets suggest that a special case of an erasure error is a prominent error while using combinatorial DNA-based data storage systems. This is especially true when using low read counts, as required by larger scale systems. This underscores the need to develop error correction codes, specifically designed to mitigate such erasure errors in combinatorial DNA data storage systems. For a detailed description of the error correction mechanism, see Section Error-Correcting-Codes-for-Combinatorial-DNA, which shows code construction that is based on mathematical effort by Sabary et al.³¹.

Large-scale experimental proof of concept

Previous research in DNA-based data storage, particularly utilizing combinatorial DNA encodings, has highlighted the necessity of validating the technology and examining the information channels at larger scales. For example, the experimental work of Anavy et al.¹⁰, highlights that large composite DNA alphabets cannot be used in large scale due to the challenges of inferring exact base ratios. This was also demonstrated in the theoretical work that analyzed the error patterns and coverage depth issues in composite and combinatorial DNA-based data storage systems ^{22,30,33–35}. Thus, scaling up combinatorial DNA-based data storage systems requires a comprehensive analysis of the information channel to maintain reliability and efficiency. For that we preformed a large-scale experimental proof of concept study of an end-to-end combinatorial DNA-based storage system. We encoded and decoded a 2.67KB input message using the combinatorial shortmer scheme. Since combinatorial DNA synthesis technology is not yet available, the combinatorial approach was implemented using Bridge oligonucleotide assembly by Yan et al.²⁷ as an ad-hoc imitation for combinatorial synthesis.

The experiment constructed 640 combinatorial sequences with similar characteristics to the experiment described in Yan et al.²⁷ (see Table 2). Each combinatorial sequence contained a barcode (referred to as inner barcodes 1–64) and Inline graphic payload cycles over a combinatorial alphabet with and . The assembly was performed using DNA fragments composed of a 20-mer information sequence and an overlap of 25 bp between adjacent fragments, as shown in Fig. 4a. The 640 combinatorial DNA sequences were constructed in 10 separate pools of 64 sequences each. Each pool was then barcoded separately (referred to as outer barcodes 1–10) and prepared to sequencing using Oxford Nanopore Technologies (See Protocol for large scale proof of concept experiment). The 10 pools were then sequenced together producing 3,855,233 reads (see Table 3). Only 2,136,534 reads of the output were of length more than 500bp (average = 516, median = 517, sd = 133, see Fig. 4b). The average number of reads per sequence over all the 640 sequences is 405 with one pool (outer barcode 2) containing only an average of 45 reads per sequence (See Table 4).

Table 2.

Design parameters of the large-scale proof of concept experiment.

Parameters	Value
Text size	2.67 KB
Number of sequences	640
Sequence length/number of cycles (M)	8
Logical density (bits/ Synthesis Cycle)	6
Number of unique k-mers (N)	8
Number of unique k-mers in each letter (K)	4
Combinatorial alphabet size
Error correction method	2D RS
Inner error-correcting code	RS(8,6)
Outer error-correcting code	RS(32,30)

Open in a new tab

Fig. 4 — Analysis of data from large-scale experimental proof of concept. (a) Sequence design and potential errors. The top diagram presents a schematic description of the sequence design, including universals and payloads. The bottom two diagrams show two potential erroneous reads, in which the position of the universal sequences and payload varies. The red cross corresponds to missing or unidentifiable payloads. (b) Read length distribution, (c) All universals start position in the sequences.

Table 3.

Summary of the counts of the sequencing reads analyzed in the large-scale proof of concept experiment.

Condition	Value
Number of reads	3,855,223
Number of reads with 500bp+	2,136,524
Number of reads with BC identified	260,245
Recovered sequences before error-correction (2D RS)	631/640
Recovered sequences after error-correction (2D RS)	640/640

Open in a new tab

Table 4.

Summary of the average number of reads per outer BC. Each outer BC was conducted in a different well. Note that the second outer barcode had fewer reads, and therefore it is highlighted in bold.

Outer BC	Inner BC	Average number of reads
1	1–64	395
2	65–128	45
3	129–192	454
4	193–256	646
5	257–320	418
6	321–384	468
7	385–448	406
8	449–512	412
9	513–576	416
10	577–640	405

Open in a new tab

In this experiment, we incorporated error-correcting codes to enable data recovery even in the presence of errors. Specifically, we utilized a 2D Reed-Solomon (RS) code, as described in Preuss et al.²⁵. Briefly, for each 6-symbol sequence, an inner error correction code (RS(8,6)) was applied adding two redundancy symbols and correcting a single sequence specific error. Additionally, for each block of 30 sequences, an outer code (RS(32,30)) was used adding two redundancy symbols to every sequence of 30 symbols at each given position.

In the original sequence design, every payload sequence position is adjacent to a specific universal sequence (Fig. 4a). However, our analysis revealed that the sequences contained gaps of various lengths with unidentifiable sequences. This can be seen in Fig. 4b,c, where the overall sequence length exceeds the expected length and the position in which each universal sequence is found within the sequence varies significantly. These observations suggest that the naive BLAST-based algorithm presented before required some adjustments to support longer combinatorial sequences containing multiple cycles. We thus adjusted the BLAST-based decoding algorithm to support the identification and analysis of multiple universal-payload sequence pairs in every analyzed read. The algorithm was also adjusted to support analysis of reads containing only a subset of the Universal-payload pairs. Full description of the adjusted algorithm is available in the Methods section BLAST-based analysis of large-scale combinatorial experiment.

The analysis of 100% of the data revealed that 631 sequences were successfully recovered out of a total of 640 sequences. For each of these 631 sequences, all four payload sequences across eight cycles were fully reconstructed. Our 2D RS error-correcting approach effectively corrected the errors in the 9 unsuccessful sequences, allowing us to successfully recover all 640 sequences and the original encoded message.

Although these 9 sequences were ultimately corrected by the 2D RS code, we conducted a deeper analysis of their error profiles to gain a better understanding of the underlying error behavior. The primary issue appears to be the low read count in the sequence pool that were barcoded with outer barcode BC2. This limited read coverage significantly impacted the recovery of these 9 sequences. A characterization of the different error types is described in the histograms presented in Fig. 5. In the histogram presented in Fig. 5a, we can see that in the seventh cycle of inner barcode BC70 we observed more than four payloads, with a tie on the fourth most common payload sequence. We therefore had to select which payload sequence to retain. This approach occasionally resulted in selecting the incorrect set of four payloads, which ultimately led to the wrong letter, Inline graphic , being chosen. In the histogram presented in Fig. 5b, we can see that in the seventh cycle of inner barcode BC75, only three payload sequences are observed, resulting in an erasure error. As seen in Fig. 5c, the histogram shows that the second cycle of inner barcode BC583, there were three payloads with counts exceeding 75, while the fourth payload has a significantly lower count. Notably, the count for payload 6 (red square) is higher than that of payload 4 (green square), even though payload 4 represents the correct payload. To avoid accidental mixups, we sometime require to observe Inline graphic copies of each payload sequence. This may result in erasure errors where only three payload sequences are observed with enough copies. Choosing the correct threshold t is an important design parameter for the decoding process representing the tradeoffs between different error types. Similar read-count histograms for all 9 unsuccessful sequence is available in Supplementary Experimental proof of concept 9BC that did not recover out of the 640BC.

Fig. 5 — Experiment analysis of our first large-scale datasets. This dataset includes 640 combinatorial sequences and the parameters are , , and . (a) Payload count of sequence BC70, cycle 7, showing a tie where payload 1 (orange square) was incorrectly chosen over payload 3 (green square). (b) Payload count of sequence BC75, cycle 7, illustrating a case where only three payloads were recovered instead of four. (c) Payload count of sequence BC583, cycle 2, showing three payloads with counts exceeding 75 with the forth having very few reads. (d) Subsampling analysis of the sequencing reads. The X-axis represents the average number of reads per sequence, and the Y-axis shows the number of unique shortmers observed out of 4, average is taken over all barcodes and all cycles. Error bars represent the standard deviation across 10 independent subsampling repeats for each read depth.

We next performed a subsampling analysis in which we analyzed the results for a random subset of the sequencing output. We see that, with an average of Inline graphic reads or less per sequence we encounter similar erasure errors more frequently, see Fig. 5d.

Error-correcting codes for combinatorial DNA

Our analysis of previously published datasets and our experimental PoC datasets (see Figs. 2, 3 and 5) shows a high occurrence of composite erasures, in which one (or more) of the shortmers composing a combinatorial letter is absent from the sequencing reads. Such an event is more likely to occur when the coverage depth is relatively low. To address these specific errors in combinatorial DNA, based on the construction proposed by Sabary et al.³¹, we propose an error-correction scheme, which we name as the Combinatorial VT code, or shortly Combi VT code. A thorough analysis of the redundancy properties and a formal mathematical description of the Combi-VT code construction are provided in Sabary et al.³¹. In that work, the composite asymmetric erasure error model is studied from a coding-theoretic perspective, an extension of Hamming distance is used as a distance metric for such code, and explicit bounds on the redundancy and achievable information rate (information density) are derived. In particular, it is shown that in the regime Inline graphic , the redundancy of the proposed construction is near-optimal, differing from the theoretical optimum by an additive term of less than . The same work also introduces several generalizations of the basic construction that enable correction of multiple asymmetric erasures within a single combinatorial symbol. Generally speaking, and as demonstrated both in simulations and in real datasets in the following two sections, assuming similar combinatorial alphabets, since this scheme is specifically designed to address composite asymmetric erasures, it can correct a greater number of such errors while using the same redundancy as the 2D RS code.

In this scheme, sequences over the combinatorial DNA alphabet are modeled as binary matrices, where rows represent the combinatorial letters in their order within the sequence, and columns represent the building block shortmers. Each row contains exactly K ones, indicating the set of K building block shortmers that compose the combinatorial letter (see Fig. 6d). In this model, missing shortmers (composite erasures) are treated as asymmetric errors, where bits in the matrix are flipped from 1 to 0, but not vice versa (see Fig. 7b). The scheme leverages tensor-product (TP) codes³⁶ to combine standard erasure and substitution-correcting codes (e.g., Reed-Solomon (RS) codes³⁷) with Varshamov-Tenengolts (VT) codes³⁸, which are designed to correct asymmetric errors. Using this construction, it is possible to correct typical error patterns involving missing shortmers and to reliably decode information stored in combinatorial DNA molecules.

Fig. 6 — Example of the encoding process for combinatorial sequences using VT syndromes and RS coding, with parameters , , , . (a) The user’s binary information is first mapped into valid combinatorial letters, each represented as a column in the matrix and defined by a set of shortmers (blue squares with value 1). (b) For each combinatorial letter, the VT syndrome is computed. The resulting sequence of syndromes forms a codeword in a Reed-Solomon (RS) error-correcting code, namely RS(7, 5) over , this encoding can correct up to two erasures. (c) The RS-encoded VT syndromes are mapped into additional combinatorial letters and appended to the original information-carrying letters, forming the complete encoded matrix. (d) The final combinatorial sequence is obtained from the complete matrix and prepared for synthesis. For more details, analysis, and generalizations of this code construction, the reader is referred to Sabary et al.³¹.

Fig. 7 — Example of the decoding process for combinatorial sequences using VT syndromes and RS coding, with parameters , , , . (a) Matrix representation of the received combinatorial sequence after sequencing. Missing shortmers, shown as red 0s, correspond to asymmetric errors (1s flipped to 0s). (b) For each combinatorial letter (matrix column), the VT syndrome is computed from the observed shortmers. Missing shortmers cause the corresponding syndrome values to be unknown (marked with “?”). The sequence of VT syndromes was designed to be a codeword in a RS error-correcting code, and is capable correcting two erasures. (c) RS decoding is applied on to recover the missing VT syndromes, even when shortmers are absent. (d) Using each recovered VT syndrome together with the correspond observed shortmers in its row, the decoder identifies and completes the missing shortmers, and thus the matrix is recovered.

Our proposed error-correcting code is specifically tailored to handle combinatorial asymmetric errors and is based on the following idea. Since each combinatorial letter is a set of K k-mers, the code partitions all possible such sets according to a mathematical expression known as the Varshamov-Tenengolts (VT) syndrome. Each set of K k-mers, can be mapped to a length-N binary vector with exactly K ones. The N entries of the vectors corresponds to the N building blocks k-mers, and 0 and 1 corresponds to the presence or absence (respectively) of a k-mers in the set. Thus, the VT syndrome of each set can be uniquely calculated using the formula which is given as follows. Given a length-N binary vector Inline graphic , and a positive integer p, we define the VT-syndrome over p³⁸ of , denoted by , as follows Note that we have that, . For the simplicity of our construction, we define p as the smallest prime power, such that .

As shown by Varshamov and Tenengolts³⁸, this syndrome enables correction of any single asymmetric error. Thus, as depicted in Fig. 7c,d, if only Inline graphic out of K k-mers are observed for a combinatorial letter, it is possible to identify the missing k-mer by comparing the VT syndrome of the observed symbols to the VT syndrome of the original transmitted word and thereby recover the full combinatorial letter.

In our scheme, the VT syndromes of the combinatorial letters in each sequence are protected against errors. This is done by encoding the vector of VT syndromes using an RS code (see Fig. 6b). Thus, in our setting, it is possible to recover up to two VT syndromes per sequence, and then use each recovered syndrome to identify and correct the corresponding missing shortmer. The construction is flexible and can be extended to scenarios with a larger number of errors. In³¹, it is described how to adapt the construction to handle multiple missing shortmers in multiple combinatorial letter by increasing the redundancy of the RS code and using extensions of the VT syndromes, thereby enabling the correction of more than two erroneous letters in each sequence.

Encoding

Based on this idea, the encoding process begins by mapping most of the user’s binary information into valid combinatorial letters, each defined by a set of k shortmers (as described in Fig. 1 and Fig. 6a, note that in the latter, the matrix is shown as the transpose of the one in the former). For each such a letter, our encoder calculates its VT syndrome (see Fig. 6b). This mapping associates each letter with a unique VT syndrome value and organizes the space into disjoint classes (known as cosets) according to the syndrome. The vector of VT syndromes is then encoded using an RS code, which provides protection against errors and allows to recover them in case of errors (see Fig. 6b). Next, the remaining information bits from the user, together with the RS-encoded VT syndromes, are mapped to additional combinatorial letters (see Fig. 6c). Finally, the complete combinatorial sequence is sent for synthesis (see Fig. 6d).

To support both structure and error correction, we define two types of encoding functions. The first maps a binary message directly to a combinatorial letter. The second maps a shorter binary message together with a VT syndrome, which will later allow the decoder to identify and correct missing shortmers. This two-stage design provides a compact representation while ensuring robustness to asymmetric errors. The full encoder of this construction is given in Algorithm 1 of Sabary et al.³¹.

Decoding

As described above, any shortmer that is missing because it was not identified or sequenced appears as an asymmetric error in the matrix representation of the encoded combinatorial sequence. For example, in Fig.7a, two entries in the matrix with value 1 have flipped to 0.

The decoding process operates in two stages. First, the VT syndromes of the combinatorial letters are computed (see Fig. 7b). Since each VT syndrome is protected with an RS code, it can be recovered even when shortmers are missing (see Fig. 7c). Second, the decoder uses the Inline graphic observed shortmers and their associated VT syndrome to recover the missing shortmer, enabling complete and accurate decoding (see Fig. 7d). Once the matrix is fully recovered, the original binary information can be decoded through the predefined mapping. A full description of the decoding algorithm is provided in Sabary et al.³¹, and an implementation of this decoding is available in our code.

It is important to note that this construction is applicable only under specific code parameters. In particular, it requires the combinatorial letters to be approximately uniformly distributed across the VT syndrome classes. This uniformity ensures that all syndrome classes are populated and usable for encoding, which is essential for enabling decoding from partial observations. A mathematical description of this constraint, along with a proof of the construction, is provided in ³¹. A schematic description of the encoding process is described in Fig. 8.

Fig. 8 — End-to-end workflow of a combinatorial DNA-based storage system. A binary message is broken into chunks of information bits, these chunks are barcoded, and encoded into a combinatorial alphabet, while some of the infromation bits in each chunk are designated for redundancy (i). On each chunk, VT syndrome calculations are performed on each combinatorial letter (ii). An RS encoding is added to each VT syndrome output chunk (used as the syndrome) (iii). The designated information bits are added to every in each chunk (used as the syndrome index), and these (,) bits are then converted into combinatorial letters (iv). The combinatorial message is synthesized using combinatorial shormer synthesis (v), and the DNA is sequenced (vi). Next, the combinatorial letters are reconstructed (vii), and the VT syndrome is performed on the combinatorial letters (viii). Using RS on each chunk, the letters are decoded (ix), and finally, the combionatorial letters and information bits are recovered (x), followed by translating the infromation back into the original binary message (xi).

We note that a direct comparison between the proposed Combi-VT coding scheme and previously published DNA-based data storage coding schemes is not straightforward. First, combinatorial DNA relies on a fundamentally different data-representation paradigm compared to conventional base-by-base DNA synthesis. Second, the underlying error models differ substantially: while most prior works focus on substitution, insertion, and deletion errors at the nucleotide level, the present work addresses asymmetric erasures at the shortmer level, which arise naturally in combinatorial synthesis. For example, the DNA Fountain scheme⁸ assumes that a sufficient fraction of sequencing reads are effectively noise-free, whereas Antkowiak et al.¹⁸ proposed a Reed-Solomon-based approach designed to tolerate higher synthesis error rates in low-cost manufacturing settings. Chandak et al.¹⁷ introduced an LDPC-based coding framework integrated directly with the Nanopore basecaller, targeting a different error profile and system architecture. Combinatorial synthesis is a relatively recent approach to DNA data representation and exhibits a distinct and asymmetric error structure compared to prior synthesis technologies. Accordingly, we do not claim that the proposed coding scheme is universally superior to existing DNA storage codes. Rather, our objective is to design and evaluate a coding strategy that is well-matched to the dominant error mechanisms inherent to combinatorial DNA synthesis.

Comparison of the combinatorial VT code and the 2D RS code using simulations

We compared the error correction capabilities of the combi VT scheme and the 2D RS scheme using an end-to-end simulation, similar to the one used in Preuss et al.²⁵. Briefly, 10KB messages were encoded using a combinatorial alphabet with Inline graphic and corresponding to 6 bits per symbol. Each message was encoded using 2432 combinatorial sequences of length 7 (including EC redundancy symbols and sequences). The parameters of the two inner EC codes used the same redundancy level of 6 redundancy bits for every 30 bits of information. The 2D RS EC used an RS(7,6) inner code in each sequence, and the combi VT code uses an RS(7,5) EC to encode the VT syndromes vector. The outer EC code used RS(32,30), adding two redundancy sequences for each 30-sequence block. A simulation of combinatorial synthesis, molecule sampling and sequencing was then done on the combinatorial sequences using tunable parameters for insertion, deletion, and substitution error rates and for the sampling rate (see Methods Simulation of a combinatorial DNA-based storage system). The results of the simulation runs are summarized in Fig. 9. Figure 9a depicts a schematic representation of the simulation workflow and indicates the reconstruction rates and Levenshtein distance calculations throughout the process . Each run included 30 repeats with random input texts of 10 KB. In Fig.9b, an average of 20 reads per sequence were sampled, demonstrating the effects of both error correction codes. The results clearly show that the combi VT code achieves better reconstruction outcomes compared to the 2D RS method²⁵ with a significant average improvements of 0.012 (P-value = 3.82e-11,two-sided t-test), and 0.005 (P-value = 7.53e-8,two-sided t-test) in the normalized Byte-level Levenshtein distance score for After inner EC and After outer RS, respectively. Furthermore, as shown in Fig.9c, when using 10, 15, 20, and 30 sampled reads per sequence, the new error correction code for the combinatorial encoding scheme³¹ consistently provides better results and more accurate data reconstruction. Notably, the sampling rate is a dominant factor affecting the reconstruction rate. Even with no synthesis and sequencing errors, low sampling rates yield poor results (Fig. 9c) that the error correction cannot overcome. The effect of substitution errors on overall performance is comparatively smaller and easier to detect and correct. This is because substitution errors occur at the nucleotide level while insertion and deletion errors occur at the k-mer level. The minimal Hamming distance Inline graphic of the trimer set allows for the correction of single-base substitutions.

Fig. 9 — Comparison of error correction schemes using end-to-end simulations. (a) A schematic view of the simulation workflow including: encoding into a combinatorial message (1), Inner Error correction encoding (2), Outer Error correction encoding (3), DNA synthesis and sequencing simulation (4), combinatorial letter reconstruction (5) Outer Error correction decoding (6). Inner Error correction decoding (7), and decoding of the original text message (8). The Roman numerals (i–iv) represent the different performance evaluation steps. (b) Error rates observed in different stages of the decoding process. Boxplot (representing 30 simulations) of the normalized Byte-level Levenshtein distance (See Methods Simulation Reconstruction) in the different stages of the simulation. Simulation was done with a coverage of 20 reads per barcode, with a substitution error rate of 0.01 for Combi VT code (grey) and 2D RS EC (blue). (c) Sampling rate effect on overall performance. Average normalized Byte-level Levenshtein distance (over 30 simulations) as a function of sampling rate before EC decoding, and after decoding (ii). Different lines represent different error types (substitution, deletion, and insertion) introduced at a rate of 0.01 and 0.

Comparison of the combinatorial VT code and the 2D RS using actual experimental data

To evaluate the effectiveness of our proposed error correction scheme, we conducted a second large-scale experiment using the combinatorial Bridge oligonucleotide assembly by Helixworks Technologies Ltd. The experiment compared the performance of the traditional 2D RS-based approach with our newly designed combinatorial error correction method. The experimental parameters where slightly modified to allow a fair comparison of the two approaches using identical redundancy levels. Briefly, we encoded and decoded a 2.66KB input message using 640 combinatorial sequences, with Inline graphic cycles, and (see Table 5). see Section First Experimental Proof of Concept 640 Sequences and 8 Cycles for more details.

Table 5.

Design parameters of the second large-scale experiment.

Parameters	Value
Text size	2.66 KB
Number of total BC	640
Number of BC for 2D RS	320
Number of BC for new EC	320
Number inner BC	64
Number outer BC	10
Sequence length/number of cycles (M)	7
Redundancy bit	6bit
Error correction method inner code	(7,6)
RS outer code	(32,30)
Oligo length (nuc)	475 bp
BC length inner	75 bp
BC lenth outer, Payload, Uni length	25 bp
Number of unique k-mers (N)	8
Number of unique k-mers in each letter (K)	4
Logical density (bits/ Synthesis Cycle)	6
Alphabet size

Open in a new tab

The sequencing output 3,615,438 contained reads (see Table 6) out of which 1,590,536 reads were of length more than 450bp (average = 420, median = 419, sd = 141, see Fig. 10a). The average number of reads per sequence over all the 640 sequences is 621. (See Table 7 and Fig.10c).

Table 6.

Summary of the counts of sequencing reads analyzed in the second large-scale experiment.

Condition	Value
Number of reads	3,615,438
Number of reads with 450 bp+	1,590,536
Number of reads with BC identified	397,806
Expected BC	640
Recoverd BC	640
Recoverd BC before error correction	632
BC not fully recovered	8

Open in a new tab

Fig. 10 — Experiment analysis. (a) Read length distribution, (b) All universals start position in the sequences. (c) Read coverage distribution, with the x-axis representing the number of reads and the y-axis representing the number of barcodes. (d) X-axis is the average read per sequence, and the y-axis is the number of observed unique shortmers out of the 8, with , , , 640 different sequences. (e) Comparison of decoding performance analysis, with the x-axis showing the average reads per sequence and the y-axis representing normalized byte-level Levenshtein distance. (f) Comparison of error correction performance, with the x-axis representing the number of unique shortmers observed and the y-axis showing normalized byte-level Levenshtein distance.

Table 7.

Summary of the average number of reads per outer BC in the second large-scale experiment. Each outer BC was conducted in a different well.

Outer BC	Inner BC	Average number of reads
1	1–64	897
2	65–128	769
3	129–192	808
4	193–256	680
5	257–320	610
6	321–384	1103
7	385–448	212
8	449–512	257
9	513–576	292
10	577–640	584

Open in a new tab

Similar to the POC experiment, the analyzed sequences contained gaps of various lengths with unidentifiable sequences as illustrated in Fig. 10a,b. Therefore in this experiment we also used the same BLAST-based decoding algorithm described in Methods BLAST-based analysis of large-scale combinatorial experiment. The large-scale nature of the experiment resulted in coverage variability, with some sequences receiving high coverage while others remained sparse (See Fig. 10c), directly impacting decoding accuracy.

To compare the two error-correction approaches, we split the 640 BCs in this experiment into two groups: BCs 1–320 were encoded using the traditional 2D RS-based method, whereas BCs 321–640 were encoded with our newly designed combinatorial error-correcting approach, serving as the inner code while sharing the same outer code as the former. In the combinatorial error-correcting scheme, we employed an RS(7,5) code to protect the VT syndromes. This enables the scheme to correct up to two missing shortmers (composite erasures) in each sequence. In the traditional 2D RS-based approach, we utilize an RS(7,6) code for inner error correction. Consequently, in both schemes, each sequence contains 36 bits of information and 6 bits of redundancy used for error-correction, meaning that both schemes use the same redundancy, allowing us to focus our analysis only on the error correction capabilities. Additional, both methods incorporated an RS(32,30) outer code to enhance overall reliability. Thus, as each scheme had 300 sequences of information in total, the total encoded information was 10,640 bits per experiment. Our analysis showed that down-sampling leads to an increase in the prevalence of erasure errors Fig. 10d. Specifically in the critical region of low reads coverage where the average read count per sequence is below 50. This region also exhibit an increase decoding failure rate characterized by high Levenshtein distance (Fig. 10e). Zooming in on the region with a high erasure rate reveals that, although both schemes employ the same redundancy, the Combi-VT code exhibits a slight advantage over the traditional 2D RS (Fig. 10e, inset). When focusing the analysis to sequences where erasure errors where detected (regardless of sequencing coverage), we see that the suggested error correction clearly outperforms the traditional approach (Fig. 10f). Nonetheless, the presence of other error types, such as substitutions and insertions significantly impact the decoding rate, sometimes shadowing the effect of the different error correction schemes. It is thus clear that any error correction scheme for combinatorial DNA-based data storage systems must take an holistic approach that can effectively address all possible error types.

Discussion

In this study, we investigated the occurrence and correction of erasure errors in combinatorial DNA-based data storage, focusing on their prevalence during down-sampling. In the analysis of two previously published datasets, these errors were shown to become more pronounced as sequencing depth decreased, a characteristic feature of combinatorial DNA-based data storage. While these erasure errors can result from underlying base-level insertions, deletions, or substitutions that cause certain shortmers to become unidentifiable, our analysis indicates that they also manifest as higher-level phenomena reflecting the nature of the combinatorial synthesis. We further validated these observations using a large-scale experimental proof-of-concept, confirming that erasure errors are a dominant issue, particularly under low sequencing coverage. To address these errors, we developed an asymmetric error-correcting code specifically designed to mitigate their impact. Our method consistently outperformed traditional symmetric error-correcting codes, including 2D Reed-Solomon (RS), especially in scenarios characterized by high erasure error rates.

In comparison to previous studies, which primarily focused on symmetric error-correcting codes or sequence-level corrections, our approach directly targets the asymmetric nature of errors in combinatorial DNA. K-mer-level error correction therefore serves as a complementary mechanism to base-level correction, addressing asymmetric errors inherent to combinatorial DNA representations. This is an important advancement, as it acknowledges the unique error profile of combinatorial DNA, where the reconstruction of specific combinatorial letters is compromised due to the loss of included k-mers. By tailoring our method to this problem, we achieved significant improvements in decoding accuracy. It is especially interesting to compare the method proposed in this work to the one presented in previous large-scale combinatorial work. The most suitable for comparison is the combination of fountain code and forward code used in¹⁰. The two approach differ significantly, specifically in low coverage regimes. While our proposed code focuses on the asymmetric erasure errors that dominate low coverage scenarios, using fountain code in such cases results in complete data loss as was clearly demonstrated in Anavy et. al. This further support the use of codes specifically tailored to the expected error patterns.

Our analysis revealed that down-sampling consistently led to an increase in erasure errors, particularly when the average coverage dropped below 50 reads per sequence. This critical threshold was identified in both experiments, where low sequencing depth directly impacted decoding accuracy. Additionally, the large-scale nature of our experiments introduced coverage variability, with some sequences receiving high read coverage while others remained sparse. This variability directly impacted decoding performance, emphasizing the need for error correction methods that can adapt to uneven coverage distributions. To accurately identify k-mers within each read, even in cases where sequencing errors or gaps disrupted the expected sequence structure, we employed a BLAST-based approach for sequence analysis. This choice was driven by the design of the combinatorial sequences, where each letter is represented by a set of predefined k-mers. The BLAST-based method ensured robust detection of these k-mers, even under challenging conditions, making it essential for maintaining decoding accuracy, especially in low-coverage scenarios.

We observed that our new error correction method demonstrated superior performance over the 2D RS approach, both in simulations and in the large-scale experiment, particularly in scenarios dominated by erasure errors. This performance difference can be attributed to the targeted design of our method, which leverages tensor-product (TP) codes to integrate standard erasure and substitution-correcting codes (such as Reed-Solomon (RS) codes) with Varshamov-Tenengolts (VT) codes. This design effectively addresses the asymmetric nature of errors in combinatorial DNA, where missing k-mers directly impact decoding success. As seen in the second experiment, the new error correction method proved particularly effective in recovering sequences from pools with low read counts, where traditional 2D RS failed to fully decode the data.

However, our study primarily focused on erasure errors, with other error types (e.g., substitutions, insertions, and deletions) not explicitly addressed. Our analysis showed that despite the focus on erasure errors, substitutions and insertions also affected decoding accuracy, highlighting the need for a unified error correction scheme that accounts for all possible error types. This is particularly important given the variability in sequencing quality and coverage observed in our experiments.

Overall, our findings emphasize the importance of developing unique error correction schemes for combinatorial DNA storage, particularly when the sequencing depth is limited. The adaptability of our asymmetric error-correcting code makes it suitable for a wide range of applications, beyond the datasets we analyzed. Moreover, the principles demonstrated here can be extended to other DNA-based data storage systems facing similar asymmetric error challenges.

In conclusion, our study provides a detailed understanding of erasure errors in combinatorial DNA storage and offers a practical solution for their correction. This advancement represents a significant step toward improving the reliability of DNA-based data storage systems. Our findings also raise important questions regarding the optimal integration of traditional error-correction schemes with specialized approaches tailored to combinatorial DNA. Further research is needed to develop unified error correction strategies that address the full spectrum of errors encountered in large-scale DNA storage systems. Future studies could focus on optimizing the current asymmetric error-correcting code to further enhance performance, while also exploring the impact of other error types (such as substitutions, insertions, and deletions) on decoding accuracy in combinatorial DNA. Moreover, developing alternative methods for k-mer detection beyond BLAST, capable of faster and more scalable analysis of large-scale combinatorial datasets, could significantly enhance decoding efficiency. Finally, optimizing the design of combinatorial sequences to improve k-mer detection and minimize sequencing errors could further boost system robustness. From a broader perspective, as the technology matures, it will be important to establish more general evaluation metrics for DNA data storage systems. Such metrics would enable meaningful comparisons between combinatorial DNA synthesis and base-level synthesis approaches, which currently operate under fundamentally different data representations and error models. These metrics should account not only for error-correction performance, but also for system-level considerations such as writing time, cost, required sequencing coverage, and achievable information density.

Methods

BLAST-based analysis of the data from Yan et al.²⁷

The 102,222 reads from the sequencing results were analyzed using the following procedure.

Read length filtering. Reads where filtered to retain only those with length greater than a predefined threshold of 50 nt. This resulted in 22,477 reads.
BLAST alignment. Each read was aligned against the design sequence using BLAST
- Generate a BLAST DB of the universal sequences
- Run blast with tolerance for small mismatches and gaps to accommodate sequencing errors.
Sequence structure extraction. The table of the BLAST results was analyzed. For each read, the sequences directly left and right to the identified U1 sequence were compared to the barcode sequences and the payload sequences respectively. The comparison was based on Levenshtein distance and the nearest neighbor sequence was selected as long as the distance d was .

A spreadsheet with the results of the BLAST-based analysis is available as Supplementary BLAST Analysis output of Yan et al. dat

BLAST-based analysis of large-scale combinatorial experiment

The large-scale experimental data was analyzed using an extension of the analysis pipeline presented in BLAST-based analysis of the data from Yan et al.²⁷ with the following adjustments. A total of 3,855,223 reads where analyzed.

Although BLAST is generally considered a high computational cost algorithm, in our case the “reference genome” consisted only of the universal parts and the set of building-block shortmers. As a result, the effective genome length was very short (less than 600 bp), making the BLAST runtime negligible. Nevertheless, we acknowledge that further optimization could improve decoding performance, and exploring high-performance alignment and decoding methods remains an important direction for future work.

Read length filtering. Reads where filtered to retain only those with length greater than a predefined threshold of 500 nt. This resulted in 2,136,524 reads.
BLAST alignment and positional extraction. BLAST was used with the same parameters while only the universal sequence database was altered to include all universal sequences in the large design. The output contains each read and their universals positions.
Extract universals positions in read. For each read in the BLAST output, we return a list of universal-sequence positions by analyzing the alignment coordinates reported by BLAST for each universal-sequence. First, the best alignment per universal-sequence is selected based on the bit score. This list of positions is then passed through a correction function that ensures consistent spacing between consecutive universals, using the designed distances from the original barcode layout. That is, each universal-sequence must follow the previous sequence with a valid relative distance. The position for universal-sequences with no matches are assumed based on the previous sequence. This allows the analysis of the read even with some missing universal-sequences. To demonstrate the importance of this correction - while only 32,705 reads contained all 11 universal-sequences, after applying the correction all 8 payload-sequences where identified in as many as 154,071 reads.
Barcode identification Once the positions of the universal sequences are extracted, we estimate the barcode’s location relative to the second universal-sequence (U2). Specifically, we anchor the read by subtracting the barcode length from the position of Universal 2 (U2), and then shift forward by the expected fixed offset: the length of a universal 1 and a 5’ linker region See Fig. 4a. At this computed position, we attempt to match the barcode against a known inner barcode set, allowing for Levenshtein distance 6. If a valid inner barcode is found, we then check for a matching outer barcode in the prefix of the read with Levenstein distance 4. Only if both inner and outer barcodes are successfully identified, the barcode index is assigned to the read; otherwise, the read is excluded from downstream analysis.
Payload identification After the barcode is identified, we locate the payload sequences using the positions of the internal universal sequences. Specifically, we focus on the region between U3 and U11, which frame the payloads in our design. For each expected payload position, we compute the start location by shifting the universal position forward by the known universal length. At each position, we compare the sequence in the read to the expected payload set, using Levenshtein distance 6. If a match is found, the corresponding payload index is recorded. If no match is found or if the read is too short, the corresponding payload position is marked as undetected.
Reconstruction and decoding. Once barcodes and payloads have been identified, we reconstruct the encoded message by aggregating and decoding reads according to their barcode group as follows:
- Barcode grouping. Reads are grouped based on their identified barcode, such that each group corresponds to one unique combinatorial sequence.
- Payload aggregation per position. For each barcode group, and at each payload position, we look at all the reads and collect which payloads appear there. This gives a distribution of payloads for that position, which helps us understand which ones are most likely, even when there are sequencing errors. From the observed distribution, we select the K most frequent payloads. The parameter K is determined by the size of the combinatorial alphabet. If a tie occurs when selecting the top K payloads, one of the candidates is selected arbitrarily. If fewer than K payloads are identified, then we define the amount of payloads missing as asymmetric-erasure-errors.
- Combinatorial letter assignment. The selected K payloads are interpreted as a single combinatorial letter from the predefined alphabet. If no match exists, the letter is marked as an erasure.
- Error-correcting. The full reconstructed sequence of combinatorial letters is then decoded using a two-dimensional Reed-Solomon (2D RS) error-correcting code.

Large-scale experiment comparing error-correction approachesFV

The large-scale experimental data was analyzed using the same analysis pipeline presented in REF with the following minor differences:

Each sequence contained only 7 payload-sequences and therefore the analysis focused on the region between U3-U10.
The decoding of the error correction code was applied according to the experimental setting. The first 320 sequences were decoded using 2D RS while sequences 321–640 were decoded using the new combinatorial error-correction code.

Simulation of a combinatorial DNA-based storage system

Synthesis and sequencing simulation with errors

Simulation of the synthesis process: DNA molecules corresponding to the designed sequences are generated using combinatorial -mer DNA synthesis (see Fig. 1).
- For each combinatorial sequence, determine the number of synthesized copies by sampling from . Let be the number of copies for a specific sequence.
- For every position in the sequence, uniformly sample independent -mers from the set of member -mers of the combinatorial letter in the specific position.
- Concatenate the sampled -mers to the corresponding position of each of the partial molecules, gradually building the full synthesized sequences.
Error simulation: We model synthesis and sequencing errors as follows:
- The probabilities of deletion, insertion, and substitution are specified by the parameters , , and , respectively.
- Deletion and insertion errors are considered synthesis-related and are applied at the level of entire -mers – meaning a full -mer may be deleted or inserted at a given position during the simulation.
- Substitution errors are modeled as sequencing errors and occur at the single-base level. A single base is replaced by another, independently of its position within a -mer.
Mixing:. After synthesis, the molecules are randomly mixed to reflect natural molecular combinations. This is implemented by randomized data line shuffle using an SQLite-based method, which allows efficient shuffling even for large-scale datasets³⁹.
Reading and sampling: To simulate the sequencing process, a subset of the synthesized molecules is sampled. The number of sampled reads is set to , representing the expected sequencing depth.

Reconstruction

The reconstruction and barcode decoding from sequencing reads was carried out using the following steps:

Barcode decoding. Decode the barcode sequence of each read using Reed-Solomon (RS) error-correcting code.
Grouping by barcode. Group the sequencing reads by their decoded barcode sequences, enabling the reconstruction of the combinatorial sequences.
Filtering of read groups. Exclude barcode groups with fewer than 10% of the expected number of sampled reads.
Combinatorial reconstruction. For each set of reads grouped by the same barcode:
- Traverse the sequence position by position.
- At each position, identify the K most common -mers across the reads.
- Use these frequent -mers to determine the corresponding combinatorial letter .
- Calculate the length deviation between each read and the expected sequence length.
- Exclude reads for which , as they likely contain synthesis or sequencing anomalies.
- Replace any -mers that are not valid members of the combinatorial alphabet with an erasure symbol.
Handling missing barcodes. Replace missing barcodes with an erasure sequences to enable proper outer RS decoding.
Normalized Byte-level Levenshtein distance calculation.
- Compute the Levenshtein distance between the observed sequence O and the expected sequence E.
- Normalize this distance by dividing it by the length of the expected sequence:

Code construction

Definitions and notations

For a positive integer n, let Inline graphic . For a binary vector , denotes the Hamming weight (shortly the weight) of , which is the number of ones in .

Let Inline graphic be the DNA alphabet and let be the k-mer (shortmer) length. We let be a set of shortmers, for , which are indexed lexicographically. For , we define the K-combinatorial composite alphabet of as , where each combinatorial composite symbol , for is a set of K different shortmers chosen from the shortmers set Inline graphic . For simplicity, the set of symbols in can be abstracted as length-n binary vectors of weight w in which every bit indicates whether a shortmer in belongs to the set. Thus, every is mapped to its indicator binary vector, denoted by and note that . In the code construction below, we refer to the composite symbols in our alphabet by their binary vector representation and denote the set of length-n binary vectors of weight K by Inline graphic .

A sequence of length M over a composite alphabet Inline graphic is denoted by . This sequence can be abstracted as an binary matrix , in which each row is matched with its corresponding composite symbol from .

For our construction, we consider the combinatorial composite-DNA channel, which receives an Inline graphic matrix , and outputs a noisy version of , denoted by . Similarly, we denote the rows of the matrix by , , such that is a noisy version of

Lastly, since the exact shortmers in the set Inline graphic do not matter but only the number of shortmers, we refer to the combinatorial composite alphabet from now on by the set and a length-M sequence is simply an matrix in , where refers to the set of all matrices in which the weight of every row is K.

Definition 1

Composite asymmetric errors. For a positive integer e and a row vector Inline graphic , we say that the corresponding channel output , suffers from e composite-asymmetric errors if , and

Definition 1 can be extended to matrices (sequences) as described below.

Definition 2

(t, e)-composite asymmetric errors. For positive integers e and t and a matrix Inline graphic , we say that the channel output matrix suffers from (t, e)-composite asymmetric errors if at most t rows of are noisy, each of them suffers from at most e composite-asymmetric errors.

A length-m (n, w)-composite code Inline graphic is a set of matrices over and every codeword in is referred to as a composite codeword. We say that a length-m (n, w)-composite code is a (t, e)-composite-asymmetric ECC (in short (t, e)-CAECC), if it can correct any (t, e)-composite asymmetric error. Such a code will be referred as an [m, (n, w); t, e]-composite code.

We denote by A(m, n, w) the size of the set of all binary matrices of dimension Inline graphic , in which each row is of weight exactly w, that is . We denote by A(m, n, w; t, e) the size of the largest [m, (n, w); t, e]-composite code. For a composite code , we define its redundancy to be . Furthermore, we denote by r(m, n, w; t, e) the minimum redundancy of such a composite code.

Mathematical definition of the code

To define our code construction, we first define some parameters of the VT code³⁸. Given a length-N binary vector Inline graphic , positive integers and p, we define the -VT-syndrome over p³⁸ of , denoted by , as follows Note that we have that, . For the simplicity of our construction, we assume that p is the smallest prime such that . According to Bertrand’s Postulate^?, it is known that that .

Furthermore, for an integer Inline graphic , we also define the e-complete-VT-syndrome over p, denoted by , to be and note that . This definition can be extended to matrices by their rows as follows. Given a matrix , whose rows are given by , and define the e-complete-VT-syndrome-vector over p of , denoted by , to be the vector in which its i-th entry corresponds to the e-complete-VT-syndrome of the i-th row in Inline graphic . That is,

Using the above definitions, the code construction is given below.

Code construction

Let Inline graphic , p be the smallest prime number such that , and be an code over capable of correcting t erasures. If , then is selected as an MDS code with . The code is defined as follows.

To construct a composite-asymmetric error-correcting code, we first define the following parameters. Let Inline graphic be a positive integer, and let be the smallest prime number such that . We consider an code over the finite field , capable of correcting erasures. When , the code is selected as a maximum distance separable (MDS) code with dimension .

Using these ingredients, we define the composite code as follows:

Construction 1

Let Inline graphic , be the smallest prime number such that , and let be an MDS code over . The composite code is defined by:

where Inline graphic denotes the -complete-VT-syndrome-vector of the matrix .

Theorem 1

The code Inline graphic is a -composite-asymmetric error-correcting code (-CAECC).

Proof

Consider an Inline graphic matrix , obtained from a codeword , where suffers from composite-asymmetric errors. Denote by the indices of the erroneous rows of , where each row has exactly ones instead of . Without loss of generality, we prove the correction for the first erroneous row, ; the same argument applies to the others.

Since Inline graphic is capable of correcting erasures, we can use its decoder to recover the correct -complete-VT-syndrome of the -th row, denoted by . Consequently, we can also recover the individual -VT-syndromes, , for .

Let Inline graphic be the set of indices corresponding to the positions in which differs from , i.e., the locations of the asymmetric errors. By construction, the following relation holds for every :

Therefore, for each erroneous row, we obtain Inline graphic such equations. As established in Theorem 1 in⁴⁰, these equations can be expressed as the elementary symmetric polynomials of the error locations, leading to a polynomial whose roots are precisely the error positions .

The coefficients of this polynomial can be determined using the Newton–Girard identities⁴¹, and its roots are uniquely defined over Inline graphic by Lagrange interpolation⁴². Thus, the set of error positions can be efficiently determined, and the erroneous row can be corrected by simply flipping the bits at these positions. Applying the same process to all erroneous rows completes the decoding, which proves the theorem.

Molecular biology experimental protocols

Synthesis and sequencing

Synthesis and sequencing of datasets were both made by Helixworks Technologies, Ltd, MTU Campus, Bishopstown, Rubicon Centre, Cork,, Ireland, T12 Y275.

Protocol for large scale proof of concept experiment

Each plate of 64 oligonucleotides was barcoded using the Oxford Nanopore Technologies (ONT) Native Barcoding Kit 96 V14 (SQK-NBD114.96), and pooled into a single vial. The prepared vials were ready for the adapter-ligation and clean-up steps, as outlined in ONT’s Ligation Sequencing Amplicon protocol (SQK-NBD114.96).

For data citations of datasets uploaded to e.g. figshare, please use the howpublished option in the bib entry to specify the platform and the link, as in the Hao:gidmaps:2014 example in the sample bibliography file.

Supplementary Information

Supplementary Information 1.^{(2.6KB, txt)}

Supplementary Information 2.^{(9KB, zip)}

Supplementary Information 3.^{(253.6KB, pdf)}

Acknowledgements

This project has received funding by the European Union (DiDAX, 101115134). Views and opinions expressed are, however, those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them. The authors thank Zohar Yakhini’s research group and Eitan Yaakobi’s research group for useful discussions. Lastly, the author thanks Nimesh Pinnamaneni from HelixWorks technology for useful discussion, and for his help in conducting this project. Preliminary results related to the constructions presented in Section were presented at the IEEE International Symposium on Information Theory (ISIT) 2024³¹.

Author contributions

All authors designed the study. I.P. and O.S. implemented coding, simulations, and experimental protocols. I.P and L.A. performed the data analysis. O.S. and R.G. led the formal mathematical investigation. Z.Y, E.Y, and L.A. supervised the study. I.P., O.S. and L.A. led the manuscript writing with contribution from all authors.

Data availability

The raw data is available in ENA (European Nucleotide Archive). The datasets generated and/or analyzed during the current study are available in the ENA (European Nucleotide Archive) repository. The dataset associated with Section Experimental Proof of Concept is available under Accession Number PRJEB89959, and the dataset supporting Section Experimental Comparison of the Suggested Code to 2D RS is available under Accession Number PRJEB89961.

Code availability

Simulation of the errors and retrieval pipelines. Combi VT code: GitHub link. 2D RS code: GitHub link. Experiment design, including encoder and decoder. Experiment of the first large-scale datasets (encoded with 2D RS): GitHub link. Experiment of the first large-scale datasets (encoded with both 2D RS, and with combi VT code): GitHub link. Data Analysis of previously published datasets. Yan et al.²⁷ experiment: GitHub link. Preuss et al.²⁵: GitHub link.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Inbal Preuss and Omer Sabary.

Contributor Information

Inbal Preuss, Email: inbalpreuss@gmail.com.

Omer Sabary, Email: omer.orange@gmail.com.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-026-38599-0.

References

1.Taylor, P. Amount of data created, consumed, and stored 2010-2020, with forecasts to 2028. Statista. https://www.statista.com/statistics/871513/worldwide-data-created/(accessed on 24 October 2023) (2024).
2.Coughlin, T. 175 Zettabytes by 2025. Forbes. Available online: https://www.forbes.com/sites/tomcoughlin/2018/11/27/175-zettabytes-by-2025/ (2018).
3.Reinsel, D., Gantz, J. & Rydning, J. The Digitization of the World - From Edge to Core. An International Data Corporation (IDC) White Paper (Seagate) (2018).
4.Church, G. M., Gao, Y. & Kosuri, S. Next-generation digital information storage in DNA. Science337, 1628–1628 (2012). [DOI] [PubMed] [Google Scholar]
5.Grass, R. N., Heckel, R., Puddu, M., Paunescu, D. & Stark, W. J. Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angewandte Chemie Int. Edn.54, 2552–2555 (2015). [DOI] [PubMed] [Google Scholar]
6.Goldman, N. et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature494, 77–80 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Yazdi, S. H. T., Gabrys, R. & Milenkovic, O. Portable and error-free DNA-based data storage. Sci. Rep.7, 5011 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Erlich, Y. & Zielinski, D. DNA Fountain enables a robust and efficient storage architecture. Science355, 950–954 (2017). [DOI] [PubMed] [Google Scholar]
9.Organick, L. et al. Random access in large-scale DNA data storage. Nat. Biotechnol.36, 242–248 (2018). [DOI] [PubMed] [Google Scholar]
10.Anavy, L., Vaknin, I., Atar, O., Amit, R. & Yakhini, Z. Data storage in DNA with fewer synthesis cycles using composite DNA letters. Nat. Biotechnol.37, 1229–1236 (2019). [DOI] [PubMed] [Google Scholar]
11.Bar-Lev, D., Sabary, O. & Yaakobi, E. The zettabyte era is in our DNA. Nat. Comput. Sci. 1–5 (2024). [DOI] [PubMed]
12.Sabary, O., Kiah, H. M., Siegel, P. H. & Yaakobi, E. Survey for a Decade of Coding for DNA Storage (Biological, and Multi-Scale Communications, IEEE Transactions on Molecular, 2024).
13.Bar-Lev, D., Orr, I., Sabary, O., Etzion, T. & Yaakobi, E. Scalable and robust DNA-based storage via coding theory and deep learning. Nat. Machine Intell. 1–11 (2025).
14.Wu, J. et al. Stable DNA storage encoding scheme based on repeating substring tree. IEEE Trans. Comput. Biol. Bioinform.22, 2184–2193. 10.1109/TCBBIO.2025.3586008 (2025). [DOI] [PubMed] [Google Scholar]
15.Cao, B. et al. PELMI: Realize robust DNA image storage under general errors via parity encoding and local mean iteration. Brief. Bioinform. 25, bbae463. 10.1093/bib/bbae463 (2024). [DOI] [PMC free article] [PubMed]
16.Liu, Z. et al. Family of mutually uncorrelated codes for DNA storage address design. IEEE Trans. NanoBiosci. (2025). [DOI] [PubMed]
17.Chandak, S. et al. Improved read/write cost tradeoff in DNA-based data storage using LDPC codes. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 147–156 (IEEE, 2019).
18.Antkowiak, P. L. et al. Low cost DNA data storage using photolithographic synthesis and advanced information reconstruction and error correction. Nat. Commun.11, 5345 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
19.LeProust, E. M. et al. Synthesis of high-quality libraries of long (150mer) oligonucleotides by a novel depurination controlled process. Nucleic Acids Res.38, 2522–2540 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Choi, Y. et al. High information capacity DNA-based data storage with augmented encoding characters using degenerate bases. Sci. Rep.9, 6582 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Walter, F., Sabary, O., Wachter-Zeh, A. & Yaakobi, E. Coding for composite DNA to correct substitutions, strand losses, and deletions. In 2024 IEEE International Symposium on Information Theory (ISIT), 97–102 (IEEE, 2024).
22.Cohen, T. & Yaakobi, E. Optimizing the decoding probability and coverage ratio of composite DNA. In 2024 IEEE International Symposium on Information Theory (ISIT), 1949–1954 (IEEE, 2024).
23.Kobovich, A., Yaakobi, E. & Weinberger, N. M-DAB: An input-distribution optimization algorithm for composite DNA storage by the multinomial channel. arXiv preprint arXiv:2309.17193 (2023).
24.Kobovich, A., Yaakobi, E. & Weinberger, N. DeepDIVE: Optimizing Input-Constrained Distributions for Composite DNA Storage via Multinomial Channel. arXiv preprint arXiv:2501.15172 (2025).
25.Preuss, I., Rosenberg, M., Yakhini, Z. & Anavy, L. Efficient DNA-based data storage using shortmer combinatorial encoding. Sci. Rep.14, 7731 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Roquet, N. et al. DNA-based data storage via combinatorial assembly. bioRxiv 2021–04 (2021).
27.Yan, Y., Pinnamaneni, N., Chalapati, S., Crosbie, C. & Appuswamy, R. Scaling logical density of DNA storage with enzymatically-ligated composite motifs. Sci. Rep.13, 15978 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Bar-Lev, D., Sabary, O., Gabrys, R. & Yaakobi, E. Cover your bases: How to minimize the sequencing coverage in DNA storage systems. IEEE Trans. Inform. Theory. (2024).
29.Abraham, H., Gabrys, R. & Yaakobi, E. Covering all bases: The next inning in DNA sequencing efficiency. In 2024 IEEE International Symposium on Information Theory (ISIT), 464–469 (IEEE, 2024).
30.Preuss, I., Galili, B., Yakhini, Z. & Anavy, L. Sequencing coverage analysis for combinatorial DNA-based storage systems (Biological, and Multi-Scale Communications, IEEE Transactions on Molecular, 2024).
31.Sabary, O. et al. Error-Correcting Codes for Combinatorial Composite DNA. In 2024 IEEE International Symposium on Information Theory (ISIT), 109–114 (IEEE, 2024).
32.Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol.215, 403–410 (1990). [DOI] [PubMed] [Google Scholar]
33.Gruica, A., Bar-Lev, D., Ravagnani, A. & Yaakobi, E. A combinatorial perspective on random access efficiency for DNA storage. In 2024 IEEE International Symposium on Information Theory (ISIT), 675–680 (IEEE, 2024).
34.Gruica, A., Montanucci, M. & Zullo, F. The Geometry of Codes for Random Access in DNA Storage. arXiv preprint arXiv:2411.08924 (2024).
35.Sokolovskii, R., Agarwal, P., Croquevielle, L. A., Zhou, Z. & Heinis, T. Coding Over Coupon Collector Channels for Combinatorial Motif-Based DNA Storage. arXiv preprint arXiv:2406.04141 (2024).
36.Wolf, J. K. An introduction to tensor product codes and applications to digital storage systems. In 2006 IEEE Information Theory Workshop-ITW’06 Chengdu, 6–10 (IEEE, 2006).
37.Reed, I. S. & Solomon, G. Polynomial codes over certain finite fields. J. Soc. Ind. Appl. Math.8, 300–304 (1960). [Google Scholar]
38.Varshamov, R. R. & Tenenholtz, G. A code for correcting a single asymmetric error. Automatica i Telemekhanika26, 288–292 (1965). [Google Scholar]
39.Hipp, R. D. SQLite. https://www.sqlite.org/index.html (2020). Accessed: January 26, 2025.
40.Dolecek, L. Towards longer lifetime of emerging memory technologies using number theory. In IEEE Globecom Workshops, 1936–1940 (Miami, FL, USA, 2010).
41.Seroul, R. & O’Shea, D. Programming for Mathematicians (Springer, 2000).
42.Nathanson, M. B. Additive Number Theory: The Classical Bases (Springer, 2010).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information 1.^{(2.6KB, txt)}

Supplementary Information 2.^{(9KB, zip)}

Supplementary Information 3.^{(253.6KB, pdf)}

Data Availability Statement

[CR1] 1.Taylor, P. Amount of data created, consumed, and stored 2010-2020, with forecasts to 2028. Statista. https://www.statista.com/statistics/871513/worldwide-data-created/(accessed on 24 October 2023) (2024).

[CR2] 2.Coughlin, T. 175 Zettabytes by 2025. Forbes. Available online: https://www.forbes.com/sites/tomcoughlin/2018/11/27/175-zettabytes-by-2025/ (2018).

[CR3] 3.Reinsel, D., Gantz, J. & Rydning, J. The Digitization of the World - From Edge to Core. An International Data Corporation (IDC) White Paper (Seagate) (2018).

[CR4] 4.Church, G. M., Gao, Y. & Kosuri, S. Next-generation digital information storage in DNA. Science337, 1628–1628 (2012). [DOI] [PubMed] [Google Scholar]

[CR5] 5.Grass, R. N., Heckel, R., Puddu, M., Paunescu, D. & Stark, W. J. Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angewandte Chemie Int. Edn.54, 2552–2555 (2015). [DOI] [PubMed] [Google Scholar]

[CR6] 6.Goldman, N. et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature494, 77–80 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Yazdi, S. H. T., Gabrys, R. & Milenkovic, O. Portable and error-free DNA-based data storage. Sci. Rep.7, 5011 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Erlich, Y. & Zielinski, D. DNA Fountain enables a robust and efficient storage architecture. Science355, 950–954 (2017). [DOI] [PubMed] [Google Scholar]

[CR9] 9.Organick, L. et al. Random access in large-scale DNA data storage. Nat. Biotechnol.36, 242–248 (2018). [DOI] [PubMed] [Google Scholar]

[CR10] 10.Anavy, L., Vaknin, I., Atar, O., Amit, R. & Yakhini, Z. Data storage in DNA with fewer synthesis cycles using composite DNA letters. Nat. Biotechnol.37, 1229–1236 (2019). [DOI] [PubMed] [Google Scholar]

[CR11] 11.Bar-Lev, D., Sabary, O. & Yaakobi, E. The zettabyte era is in our DNA. Nat. Comput. Sci. 1–5 (2024). [DOI] [PubMed]

[CR12] 12.Sabary, O., Kiah, H. M., Siegel, P. H. & Yaakobi, E. Survey for a Decade of Coding for DNA Storage (Biological, and Multi-Scale Communications, IEEE Transactions on Molecular, 2024).

[CR13] 13.Bar-Lev, D., Orr, I., Sabary, O., Etzion, T. & Yaakobi, E. Scalable and robust DNA-based storage via coding theory and deep learning. Nat. Machine Intell. 1–11 (2025).

[CR14] 14.Wu, J. et al. Stable DNA storage encoding scheme based on repeating substring tree. IEEE Trans. Comput. Biol. Bioinform.22, 2184–2193. 10.1109/TCBBIO.2025.3586008 (2025). [DOI] [PubMed] [Google Scholar]

[CR15] 15.Cao, B. et al. PELMI: Realize robust DNA image storage under general errors via parity encoding and local mean iteration. Brief. Bioinform. 25, bbae463. 10.1093/bib/bbae463 (2024). [DOI] [PMC free article] [PubMed]

[CR16] 16.Liu, Z. et al. Family of mutually uncorrelated codes for DNA storage address design. IEEE Trans. NanoBiosci. (2025). [DOI] [PubMed]

[CR17] 17.Chandak, S. et al. Improved read/write cost tradeoff in DNA-based data storage using LDPC codes. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 147–156 (IEEE, 2019).

[CR18] 18.Antkowiak, P. L. et al. Low cost DNA data storage using photolithographic synthesis and advanced information reconstruction and error correction. Nat. Commun.11, 5345 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.LeProust, E. M. et al. Synthesis of high-quality libraries of long (150mer) oligonucleotides by a novel depurination controlled process. Nucleic Acids Res.38, 2522–2540 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Choi, Y. et al. High information capacity DNA-based data storage with augmented encoding characters using degenerate bases. Sci. Rep.9, 6582 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Walter, F., Sabary, O., Wachter-Zeh, A. & Yaakobi, E. Coding for composite DNA to correct substitutions, strand losses, and deletions. In 2024 IEEE International Symposium on Information Theory (ISIT), 97–102 (IEEE, 2024).

[CR22] 22.Cohen, T. & Yaakobi, E. Optimizing the decoding probability and coverage ratio of composite DNA. In 2024 IEEE International Symposium on Information Theory (ISIT), 1949–1954 (IEEE, 2024).

[CR23] 23.Kobovich, A., Yaakobi, E. & Weinberger, N. M-DAB: An input-distribution optimization algorithm for composite DNA storage by the multinomial channel. arXiv preprint arXiv:2309.17193 (2023).

[CR24] 24.Kobovich, A., Yaakobi, E. & Weinberger, N. DeepDIVE: Optimizing Input-Constrained Distributions for Composite DNA Storage via Multinomial Channel. arXiv preprint arXiv:2501.15172 (2025).

[CR25] 25.Preuss, I., Rosenberg, M., Yakhini, Z. & Anavy, L. Efficient DNA-based data storage using shortmer combinatorial encoding. Sci. Rep.14, 7731 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Roquet, N. et al. DNA-based data storage via combinatorial assembly. bioRxiv 2021–04 (2021).

[CR27] 27.Yan, Y., Pinnamaneni, N., Chalapati, S., Crosbie, C. & Appuswamy, R. Scaling logical density of DNA storage with enzymatically-ligated composite motifs. Sci. Rep.13, 15978 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Bar-Lev, D., Sabary, O., Gabrys, R. & Yaakobi, E. Cover your bases: How to minimize the sequencing coverage in DNA storage systems. IEEE Trans. Inform. Theory. (2024).

[CR29] 29.Abraham, H., Gabrys, R. & Yaakobi, E. Covering all bases: The next inning in DNA sequencing efficiency. In 2024 IEEE International Symposium on Information Theory (ISIT), 464–469 (IEEE, 2024).

[CR30] 30.Preuss, I., Galili, B., Yakhini, Z. & Anavy, L. Sequencing coverage analysis for combinatorial DNA-based storage systems (Biological, and Multi-Scale Communications, IEEE Transactions on Molecular, 2024).

[CR31] 31.Sabary, O. et al. Error-Correcting Codes for Combinatorial Composite DNA. In 2024 IEEE International Symposium on Information Theory (ISIT), 109–114 (IEEE, 2024).

[CR32] 32.Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol.215, 403–410 (1990). [DOI] [PubMed] [Google Scholar]

[CR33] 33.Gruica, A., Bar-Lev, D., Ravagnani, A. & Yaakobi, E. A combinatorial perspective on random access efficiency for DNA storage. In 2024 IEEE International Symposium on Information Theory (ISIT), 675–680 (IEEE, 2024).

[CR34] 34.Gruica, A., Montanucci, M. & Zullo, F. The Geometry of Codes for Random Access in DNA Storage. arXiv preprint arXiv:2411.08924 (2024).

[CR35] 35.Sokolovskii, R., Agarwal, P., Croquevielle, L. A., Zhou, Z. & Heinis, T. Coding Over Coupon Collector Channels for Combinatorial Motif-Based DNA Storage. arXiv preprint arXiv:2406.04141 (2024).

[CR36] 36.Wolf, J. K. An introduction to tensor product codes and applications to digital storage systems. In 2006 IEEE Information Theory Workshop-ITW’06 Chengdu, 6–10 (IEEE, 2006).

[CR37] 37.Reed, I. S. & Solomon, G. Polynomial codes over certain finite fields. J. Soc. Ind. Appl. Math.8, 300–304 (1960). [Google Scholar]

[CR38] 38.Varshamov, R. R. & Tenenholtz, G. A code for correcting a single asymmetric error. Automatica i Telemekhanika26, 288–292 (1965). [Google Scholar]

[CR39] 39.Hipp, R. D. SQLite. https://www.sqlite.org/index.html (2020). Accessed: January 26, 2025.

[CR40] 40.Dolecek, L. Towards longer lifetime of emerging memory technologies using number theory. In IEEE Globecom Workshops, 1936–1940 (Miami, FL, USA, 2010).

[CR41] 41.Seroul, R. & O’Shea, D. Programming for Mathematicians (Springer, 2000).

[CR42] 42.Nathanson, M. B. Additive Number Theory: The Classical Bases (Springer, 2010).

PERMALINK

Error characterization and error correction approaches in combinatorial DNA-based storage

Inbal Preuss

Omer Sabary

Ryan Gabrys

Zohar Yakhini

Eitan Yaakobi

Leon Anavy

Abstract

Introduction

Fig. 1.

Fig. 2.

Results

Error characterization of previously published datasets

Table 1.

Fig. 3.

Large-scale experimental proof of concept

Table 2.

Fig. 4.

Table 3.

Table 4.

Fig. 5.

Error-correcting codes for combinatorial DNA

Fig. 6.

Fig. 7.

Encoding

Decoding

Fig. 8.

Comparison of the combinatorial VT code and the 2D RS code using simulations

Fig. 9.

Comparison of the combinatorial VT code and the 2D RS using actual experimental data

Table 5.

Table 6.

Fig. 10.

Table 7.

Discussion

Methods

BLAST-based analysis of the data from Yan et al.27

BLAST-based analysis of large-scale combinatorial experiment

Large-scale experiment comparing error-correction approachesFV

Simulation of a combinatorial DNA-based storage system

Synthesis and sequencing simulation with errors

Reconstruction

Code construction

Definitions and notations

Definition 1

Definition 2

Mathematical definition of the code

Code construction

Construction 1

Theorem 1

Proof

Molecular biology experimental protocols

Synthesis and sequencing

Protocol for large scale proof of concept experiment

Supplementary Information

Acknowledgements

Author contributions

Data availability

Code availability

Declarations

Competing interests

Footnotes

Contributor Information

Supplementary Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

BLAST-based analysis of the data from Yan et al.²⁷