Abstract
Base substitution, insertion, and deletion errors due to inherent technical constraints inducing unavoidable sequencing inaccuracies, limiting access to high-quality raw data and biological knowledge. To address this, we propose a deep sequence reconstruction model based on the multi-scale attention mechanism and contrastive learning (MACL), designed to enhance DNA sequence reconstruction under highly error rate conditions. The multi-scale attention mechanism includes base scale, inter-sequence and intra-sequence scale. First, the MSA Transformer fully extracts both global and local features of the base scale from the dimensions of the rows and columns. Furthermore, for the errors between sequences and the substitution errors within sequences, MACL proposes Inter-Sequence and Intra-Sequence Multi-Head Attention Mechanisms, respectively, and handles the insertion and deletion errors through the convolution module. In order to maximize the consistency of positive sample pairs in the representation space, we introduce contrastive learning and design a negative sample construction method and data augmentation that are more suitable for substitution errors in sequencing channels. Experiments on real-world DNA storage and viral genome datasets demonstrate that MACL significantly outperforms existing methods in reconstructing the DNA sequence. In particular, when combined with RS codes, MACL can losslessly reconstruct medical images in highly biased sequence (base error rate = 5%) in DNA storage. In summary, the MACL introduces a novel approach to DNA sequence reconstruction in highly error rate conditions, laying a solid foundation for practical applications in DNA storage and genomics research.
Keywords: DNA sequence reconstruction, DNA storage, Noisy sequencing channels, Multi-scale attention, Contrast learning
1. Introduction
DNA is the primary carrier of genetic information in living organisms. Analyzing DNA sequences reveals gene functions and regulatory mechanisms, advancing fields such as disease research and evolutionary analysis [1], [2], making precise reconstruction of DNA sequences essential [3]. DNA sequence reconstruction is the process of reassembling short DNA fragments (that is reads) obtained through sequencing into complete DNA sequences [4]. DNA sequence reconstruction can reconstruct not only natural DNA sequences, but also synthetic ones. Among them, the representative task in reconstructing natural sequences is general-purpose genome assembly [5], [6], [7], and in the synthetic one it is DNA storage [8], [9].
Additionally, DNA is expected to serve as a novelty type of data storage medium due to its high storage density [10], high stability and long life. [11], [12]. Due to the inherent limitations of DNA sequencing technologies, errors such as base substitutions, insertions, and deletions are unavoidable [13], [14], making the precise reconstruction of DNA sequences difficult [3]. In addition, assembly often relies on high-quality multiple sequence alignment (MSA). In particular, heuristic-based MSAs are time-consuming when processing large-scale DNA sequence data, and additional preprocessing is required [15]. With the advancement of deep learning techniques, researchers have begun to explore the use of deep neural networks for direct reconstruction of DNA sequences from raw data. However, current methods [16] do not consider insertion and deletion errors in DNA sequencing channels, resulting in catastrophic error crosstalk. Especially in cases of high error rates, such as nanopore sequencing, the accuracy of existing methods will decrease rapidly.
To address these challenges, this paper proposes a DNA sequence reconstruction model, MACL, based on a multi-scale attention mechanism and contrastive learning, aimed at reconstructing sequences under highly biased sequencing conditions. Such bias primarily arises from variations in sequencing coverage, error patterns, and platform-specific characteristics, which lead to significant imbalance in the number and quality distribution of read copies associated with different reference sequences, thereby posing greater challenges for sequence reconstruction methods. First, the MSA Transformer [17] is introduced to enhance the representation of data along the dimensions of the row and column, capturing the global and local characteristics of DNA sequences, which helps to comprehensively consider features at different scales. For the noise in the sequencing channel, MACL achieves the extraction of homology and difference features among sequences (identifying noise) by calculating the attention between sequences and learns the corresponding weights and reallocations. Furthermore, for the errors between sequences and the substitution errors within sequences, MACL proposes Inter-Sequence and Intra-Sequence Multi-Head Attention Mechanisms, respectively, and handles the insertion and deletion errors through the convolution module. In addition, by integrating contrastive learning with positive attraction and negative repulsion mechanisms and a negative sample construction method that is more suitable for substitution errors, MACL further improves sequence reconstruction accuracy and model robustness.
2. Related work
2.1. DNA storage and genomics assembly
In DNA storage, the following two strategies are commonly employed to ensure read-write consistency of stored data: first, coding rules [18] that satisfy biological constraints are designed to reduce the probability of errors in DNA sequences during data writing [19], [20], [21]; second, DNA sequences reconstruction from reads that contain errors by clustering and multiple sequence alignment during data reading [22], [23]. However, traditional multiple sequence alignment methods based on heuristic algorithms are usually time-consuming when dealing with large-scale raw data, which do not meet the real-time requirements of DNA storage [15]. With the advancement of deep learning, researchers have explored the use of deep neural networks [16], [24] for DNA sequence reconstruction. In genomics, sequence reconstruction is the process of reassembling short DNA fragments (i.e. reads) obtained by sequencing into complete genome sequences using assembly tools [8]. Among them, SPAdes [7], a general-purpose genome assembly tool based on de Bruijn graphs, is widely applied in microbial and macro-genomics research due to its high-quality assembly performance and flexibility.
2.2. DNA sequence reconstruction
Traditional Method: DNA sequence reconstruction is of great significance in the fields of DNA storage and genomics [25], [26], allowing accurate analysis of gene functions, correction of DNA sequencing, and preservation of data integrity and consistency. Gopalan et al. [27] improved the bitwise majority alignment reconstruction algorithm, which aligns all reads by determining the most frequent symbol at each position. Based on this, Organick et al. [22] used this sequence reconstruction algorithm to implement a large-scale DNA data storage and retrieval system. The bitwise-based algorithms majority alignment are advantageous as they do not rely on fixed coding strategies but are challenged by high time complexity and dependency on clustering accuracy. In addition, Omer Sabary et al. [28] proposed a sequence reconstruction algorithm based on dynamic programming, which computes all Inter-Sequence editing operations (insertion, deletion, or substitution) and uses them to correct selected noisy sequences. In addition, this process is further optimized by Divider BMA and Hybrid BMA, which enhance the efficiency and accuracy of DNA sequence reconstruction in large-scale clustering environments by dividing clusters into sub-clusters of varying lengths for majority voting and error correction. However, these traditional methods all have high complexity and latency when dealing with large-scale data, and their performance is average at high error rates.
Using Deep Learning: With the advent of deep learning, several researchers have utilized deep neural networks for sequence reconstruction [29]. Nahum et al. [30] modeled the error correction of the DNA sequence as a self-supervised sequence-to-sequence task [31] and introduced a single-read reconstruction model for DNA storage systems. In contrast, DNAformer [24] is a multi-read reconstruction model based on deep neural networks with a dynamic programming algorithm. Although DNAformer excels in error correction, it struggles to address the impact of noisy sequences in clusters on reconstruction performance. To address noise in clustered sequences, RobuSeqNet [16] introduces a robust multi-read reconstruction neural network, which partially mitigating the impact of noise on the reconstruction accuracy. However, raw data with high error rates (e.g., nanopore sequencing) demonstrate limited reconstruction effectiveness.
2.3. Negative sample construction and data augmentation
The construction methods of negative samples in contrastive learning vary significantly depending on the specific objective of the task. In the related tasks of DNA sequences, they can be classified into negative samples based on sequence enhancement, negative samples based on biological context, and negative samples based on evolutionary homology. The purpose of the DNA sequence reconstruction task is to reconstruct the original sequence with high quality, so negative samples with data (sequence) augmentation are more applicable. In unsupervised or self-supervised training, the most fundamental method for constructing negative samples is to randomly select sequences that are irrelevant to positive samples of the same batch as negative samples [32], and Mask is also a commonly used solution [33].
Recent advances in contrastive DNA sequence learning reveal critical limitations in modeling biological errors. In terms of data augmentation, CLMB [34] pioneers noise injection to enhance feature robustness, but treats augmentation as isolated perturbations, ignoring semantic relationships between original sequences and noisy variants. Building on CLMB, cGen [35] introduces inversions, masking, and frame shifts to diversify structural patterns, but fails to simulate context-sensitive sequencing errors such as substitutions/indels that compromise alignment accuracy. Advancing further, COMEBin [36] establishes multi-view alignment between local fragments and global sequences but assumes error-free inputs, inadvertently propagating artifacts through both perspectives. While CLMB [34] strengthens noise tolerance, cGen [35] enriches positional variation, and COMEBin [36] integrates hierarchical features, all three share a fundamental “error blindness” – they prioritize augmentation diversity over explicit error correction dynamics. This collective limitation stems from modeling synthetic distortions rather than real error propagation mechanisms, particularly problematic for non-stationary sequencing technologies. Current frameworks focus on stabilizing representations across transformations, but lack bidirectional error-repair modeling to disentangle technical artifacts from biological signals. The field urgently needs unified architectures that jointly optimize contrastive learning with probabilistic error correction, enabling adaptive reconstruction of error-prone sequences while preserving biological relevance.
3. Material and methods
3.1. Model overview
As shown in Fig. 1, the input to the proposed MACL consists of clustered DNA sequencing data. First, DNA sequences are encoded using bases as the basic unit, with position encoding added to retain positional information. Next, we efficiently extract base-level and sequence-level features through MSA Transformer and Inter-Sequence and Intra-Sequence Multi-Head Attention Mechanisms. and handle insertion and deletion errors through the convolution module. Finally, a contrastive learning strategy is employed to enhance reconstruction performance under highly error rate conditions by aligning reconstructed sequences with correct ones and distancing them from erroneous ones in the representation space.
Fig. 1.
DNA sequence reconstruction model architecture.
3.1.1. Sequence embedding
DNA sequences are prone to base substitution, insertion, and deletion errors during DNA sequencing, resulting in raw data of varying length [37]. Due to noise interference in DNA storage channels, to prevent extreme errors from misleading model training, this work employs a threshold of 10 bases around the reference sequence length during data preprocessing. This threshold, carefully calibrated to account for the length distribution in the dataset, discards sequences that are excessively short or long. This operation aims to mitigate outlier noise and enhance model generalization. For short sequences within the threshold range during training, MACL pads the sequence with the base “A” on the right side to match the reference sequence length. For longer sequences, MACL truncates the sequence on the right side to match the reference sequence length. This operation essentially standardizes the input sequence length to facilitate efficient model learning. Subsequently, this work aggregates multiple sequencing reads from the same original DNA sequence into input clusters after preprocessing. These clusters enable the model to learn base-level consensus structures across these reads, thereby achieving sequence reconstruction and error correction. Finally, DNA sequences are represented using one-hot encoding and padding. This process converts each input sequence into a matrix , where is the predefined maximum sequence length. To preserve base positional information, we introduce a learnable positional encoding method [38]. Specifically, this method allows the model to autonomously learn and capture the base position information by randomly initializing a matrix as a position vector, which is dynamically updated during training. To avoid the effect of extreme base errors on model training, we remove both too long and too short DNA sequences from the raw data before clustering.
3.1.2. Multi-scale attention mechanism
To cope with errors in DNA sequencing reads, the MACL employs a multi-scale attention mechanism to compute attention at different scales. The multi-scale attention mechanism includes an base scale (MSA Transformer), inter-sequence and intra-sequence scale (Intra-Sequence Attention and Inter-Sequence Attention).
First, we uses an MSA Transformer to represent a set of DNA sequences as a two-dimensional feature matrix. The MSA Transformer is a biological sequence language model that requires DNA sequences to be represented as Multiple Sequence Alignment (MSA). As shown in Fig. 2, the MSA Transformer fully extracts local and global features of DNA sequences on the base scale by applying the attention mechanism to the row and column directions of the input two-dimensional feature matrix, respectively. Specifically, attention in the column direction focuses on associations between bases on different sequences at the same position, capturing base distribution properties from different sequences in the same cluster at a specific position, while attention in the row direction reinforces contextual associations of bases at different positions within a single sequence. This alternating attention mechanism enables MACL to significantly enhance feature extraction at the base scale.
Fig. 2.
Multi-scale attention mechanism.
However, it is difficult to perform comprehensive feature extraction of DNA sequences on the base scale alone. In addition, the clustering process of DNA sequences is often imperfect and can be interfered with by noisy sequences, thus affecting the accuracy of sequence reconstruction. For errors between sequences, MACL adopts Inter-Sequence Multi-Head Attention Mechanism at the sequence scale, which dynamically assigns attention weights to each sequence in the clustering and prioritizes the sequences with fewer errors and higher sequence integrity. Specifically, MACL reduces the model’s dependence on noisy sequences by calculating the similarity scores between sequences [39], and after completing the attention weighting, it removes the influence of noisy sequences on the reconstruction performance by performing feature summation on the clustering dimension.
Finally, the MACL applies the Intra-Sequence Multi-Head Attention Mechanism to extract correlation information between bases within each sequence. This mechanism captures complex contextual patterns within a sequence by computing global dependencies at the base scale, and is particularly effective at identifying the positions of base substitution errors. Unlike the Row Attention Mechanism in MSA Transformer, the Intra-Sequence Multi-Head Attention Mechanism computes attention after weighting and overlaying the clustered sequences by their contributions, focusing on the sequence tensor with more features, while the Row Attention Mechanism computes attention separately for the different bases of all sequences within each cluster before weighting. Through the complementary collaboration of these two mechanisms, the MACL is expected to extract sequence features more comprehensively, enabling reliable reconstruction of DNA sequences under high error rate conditions.
It should be noted that MACL’s “column attention” does not perform attention calculations on pre-aligned sequences nor assume that positions are to be aligned explicitly. Instead, it adaptively “learns” which positions may correspond across sequences through attention mechanisms on unaligned raw sequences. This constitutes implicit alignment learning. MACL’s multi-scale attention combines intra-sequence attention with inter-sequence column attention, enabling the model to simultaneously understand base-level local structures and sequence-level global relationships. In DNA sequence reconstruction tasks, explicit MSA often fails when insertions/deletions or mutations are present. MACL’s continuous attention weights can model the contextual semantics of these variations without requiring to align. Thus, unlike MSA, MACL shifts the concept of ‘to align’ to a differentiable, context-dependent neural space.
3.1.3. Convolution layer
As shown in Fig. 2, to address insertion and deletion errors in DNA sequences, the MACL proposed in this paper employs multiple convolutional kernels of varying sizes to learn relative positional offsets caused by these errors, allowing dynamic interaction and adaptive adjustment of local features. Specifically, convolutional kernels of varying sizes capture offsets caused by local structural changes in the sequence to enhance the perception of local feature relationships and recover the bases where offsets occur.
3.1.4. LSTM layer
In order to accurately reconstruct DNA sequences, MACL uses Long-Short-Term Memory (LSTM) as a prediction layer. This is because DNA sequences often exhibit complex long-range dependencies and we leveraging memory units and gating mechanisms of LSTM to capture and retain such dependencies, thereby enhancing its ability to model DNA sequences. In this paper, we use two layers of LSTM as prediction layers, apply the ReLU activation function for nonlinear transformation, and output the estimated probability of each positional base by reducing the sequence feature dimensions to 4 through a linear layer.
3.1.5. Contrastive learning
In order to maximizes the consistency of positive sample pairs in the representation space, we introduce contrastive learning [40] into the DNA sequence reconstruction task to more accurately reconstruct the original DNA sequence through positive attraction and negative repulsion. However, the key to contrastive learning is the construction of high-quality positive and negative sample pairs. Unlike traditional construction methods, the positive sample pairs designed in this paper consist of the reference sequence of each homology cluster and the reconstructed sequence from the model. The negative sample pair consists of each sequence in the homology cluster with base transitions and transversions, paired with the reconstructed sequence . This negative sample construction method better reflects the real DNA sequence error pattern and effectively simulates common sequencing errors, significantly improving the model’s robustness under highly error rate conditions.
Since DNA sequencing data typically contain multiple copies of the same sequence, this paper calculates the contrastive loss only within a fixed DNA sequence cluster to avoid interference between different clusters. As shown in Eq. (1), where is the feature representation of the th sample, is the positive sample, is the negative sample, is the batch size, is the cluster size, is the temperature coefficient and denotes the cosine similarity between two inputs.
| (1) |
MACL combines the cross-entropy loss and contrastive loss as the loss function of the model. The formula is as follows, where represents the sequence length, is the one-hot labeling vector of the base category at the th position, is the probability distribution vector predicted by the model, and are the loss weights.
| (2) |
| (3) |
4. Results
In this paper, we trained MACL in both a real-world DNA storage dataset and a simulated dataset with error rates 1% to 5%, then evaluated its performance in real-world DNA storage and genome data. In addition, we encoded grayscale and color medical images using different encoding methods [41], [42], [43]. Then sequence reconstruction and image recovery was performed using MACL and the reconstruction performance of the model was analyzed under different error rates and encoding methods. Finally, we conducted ablation experiments to evaluate the impact of the MSA Transformer and contrastive learning.
4.1. Experimental setup
4.1.1. Problem definition and baseline
DNA sequence reconstruction is fundamentally a trace reconstruction problem [44]. Then we can consider a set of original DNA sequences , where each sequence belongs to and is the DNA base alphabet. Due to errors (base substitutions, insertions, and deletions) inherent in DNA storage channels, multiple error-containing copies are typically obtained for each original sequence . Let denote these noisy copies of .
The objective of DNA sequence reconstruction is to compute a corrected sequence from these copies such that the edit distance between and the original is minimized. This reconstruction process is formally modeled as a mapping , where the function takes the noisy copies as input and outputs the reconstructed sequence .
To ensure the fairness and generality of the comparison experiments, in this paper, four latest SOTA methods in the field of DNA sequence reconstruction are selected for comparison. These methods [16], [28], like the one presented in this paper, are independent of the encoding scheme during sequence reconstruction and can directly process a cluster of noisy reads generated from sequencing files, without relying on any specific dataset encoding method [16], [24].
-
1.
RobuSeqNet [16], a multi-read reconstruction neural network based on the Transformer codec, performs well in reconstructing clusters containing contaminated sequences. However, it cannot reconstruct sequences from nanopore sequencing data or other raw data with a high error rate.
-
2.
Divider BMA [28] input sequence clusters by sequence length, uses majority voting for sequence correction in sub-clusters equal to the reference sequence length, and applies majority voting to corresponding traces for sub-clusters smaller or larger than the reference sequence length, simultaneously detecting and correcting insertion and deletion errors.
-
3.
Hybrid BMA [28] combines the iterative algorithm with the Divider BMA algorithm to estimate the error probabilities at different base positions for each sequence in the input cluster and performs the sequence correction using majority voting.
-
4.
DNAformer [24] through AI transformation models and training with data generated by self-developed simulators, can reconstruct accurate sequences from defective replicas.
4.1.2. Datasets and training details
In this paper, real-world DNA storage datasets reported by Erlich et al. [45], Organick et al. [22], Grass et al. [46] and Srinivasavaradhan et al. [47] were used to train the models. As shown in Table 1, these datasets were synthesized by Twist Bioscience, Custom Array, and sequenced by Illumina and ONT. They vary in size and error rate, adequately representing complex and diverse DNA sequence reconstruction scenarios and enabling performance evaluation of MACL in different application settings. Additionally, each dataset contains a sequencing dataset and a corresponding reference sequence set. References represent the number of authentic encoded sequences during the DNA storage encoding phase, while reads denote the number of potentially noisy copies generated from these authentic sequences during the sequencing phase. The value of reads/references actually reflects the difficulty of the DNA sequence reconstruction task. A larger value indicates more potentially noisy copies, implying that the model is facing a more challenging task. For each experiment, the dataset is partitioned at the reference-sequence level rather than at the read level. Specifically, each unique reference sequence (i.e., the original ground-truth DNA template) and all its corresponding sequencing reads are treated as an inseparable unit. The reference sequences are first randomly split into training and test sets with a ratio of 7:3, and all sequencing reads derived from the same reference sequence are assigned exclusively to the same subset. This hierarchical “one-to-many” partitioning strategy ensures that the training and test sets are completely disjoint at the sequence source level. In other words, the model is evaluated on sequencing reads from entirely unseen DNA sequences, thereby preventing data leakage and enabling a fair and reliable evaluation of the reconstruction performance.
Table 1.
Description of the dataset.
| Dataset | Erlich 2017 | Organick 2018 | Grass 2015 | Srinivasavaradhan 2021 |
|---|---|---|---|---|
| Number of references | 72 000 | 596 499 | 4989 | 9984 |
| Number of reads | 13 332 276 | 113 783.1 | 2 949 757 | 269 709 |
| Sequence length | 152 | 110 | 117 | 110 |
| Error rate | 0.32% | 0.43% | 0.99% | 5.90% |
| Synthesis technology | Twist Bioscience | Twist Bioscience | CustomArray | Twist Bioscience |
| Sequencing technology | Illumina MiSeq | Illumina NextSeq | Illumina MiSeq | ONT MinION |
The MACL is constructed and trained using the PyTorch framework, with a single NVIDIA GeForce RTX 3090 GPU for training and testing. During training, we set the batch size to 64 and the number of epochs to 200, saving the model parameters every 10 epochs and selecting the best model based on training loss. The initial learning rate is set to 0.001 and is adaptively adjusted based on model performance. The Adam optimizer is used, with exponential decay rates and set to 0.9 and 0.98, respectively.
4.1.3. Evaluation metrics
To evaluate the reconstruction performance of MACL, we use the sequence reconstruction rate [48] and the sequence recovery rate [16] as evaluation metrics, which respectively represent the percentage of correct bases and fully recovered sequences out of the total number of reconstructed sequences. These two metrics comprehensively measure the reconstruction performance of the model both on the base scale and on the sequence scale. The specific formulas are shown below:
| (4) |
| (5) |
| (6) |
Where, represents the edit distance between the two sequences and , represents the length of the sequence , is used to determine whether the base types of the two sequences are identical at each index position and represents the number of reference sequences, , represent the reconstruction sequences and the reference sequences, respectively.
4.2. DNA storage sequence reconstruction
To evaluate the reconstruction performance of MACL in DNA storage, this section compares MACL with four existing methods in four real-world DNA storage datasets with different error rates. Especially, for DNAformer [24], limited by its limited input length as well as data input rules, we fixed the first 12 bases as a fixed index, which does not participate in the reconstruction process, and only reconstructed the 140 bases after the index, for the sequences exceeding the input length, we truncated it down to 140 bases, and for the sequences less than 140 bases, we filled the index from the right side repeatedly until the target length is reached. When finally calculating the reconstruction and recovery rates, we only calculate the indexes for the non-padded part of the sequence. The quantitative results are summarized in Table 2, which presents the sequence reconstruction and recovery performance of different methods under varying sequencing coverages (30, 20, and 10) on real-world DNA storage datasets. At high sequencing coverage (30), most methods achieve similarly high reconstruction rates under relatively low-error conditions. However, clear differences are already observed in recovery performance, particularly for datasets characterized by high sequencing error rates. For the highly error-prone nanopore sequencing data at 30 coverage, MACL outperforms RobuSeqNet by approximately 7% in reconstruction rate and achieves more than an order-of-magnitude improvement (12) in recovery rate, indicating its stronger robustness even when sequencing coverage is sufficient. As the sequencing coverage is reduced to 20 and 10, the performance disparities among different reconstruction strategies become increasingly evident.
Specifically, methods based on per-bit majority alignment (e.g., Divider BMA and Hybrid BMA) exhibit a sharp degradation in recovery performance under low coverage and high-error conditions, and in some cases fail to recover valid sequences entirely. In contrast, deep learning-based methods demonstrate substantially better robustness across varying coverages. Among them, MACL consistently achieves the highest reconstruction and recovery rates under all coverage settings. Notably, on highly error-prone datasets (e.g., nanopore sequencing) and at low coverage (10), MACL maintains stable reconstruction performance, whereas RobuSeqNet and majority-alignment-based methods suffer severe performance collapse. These results demonstrate that MACL is more resilient to both high error rates and reduced sequencing coverage, highlighting the advantage of its multi-scale attention mechanism in extracting sequence features under challenging sequencing conditions.
In addition, we evaluate the reconstruction performance of different sequence reconstruction methods under varying error rates using DNA sequences generated from medical images, including MRI slices, CT images, and fundus images. These images are first encoded into DNA sequences following standard DNA storage encoding schemes to flexibly construct test sequences with diverse structures and lengths. Sequencing noise is then introduced using DNA-storalator to simulate Illumina-specific sequencing errors in a controlled manner, enabling systematic evaluation under different base-level error rates. In this simulation, the proportions of substitution, deletion, and insertion errors follow the empirical statistics of the Illumina platform (substitution: deletion: and insertion errors is 10: 5: 1) [22]. As shown in Fig. 3, we further evaluate the reconstruction and recovery performance under different sequencing coverages (30, 20, and 10) with error rates ranging from 1% to 5%. At high sequencing coverage (30), all methods achieve relatively high reconstruction rates under low error conditions. However, as the base error rate increases, clear differences emerge in recovery performance. In particular, when the error rate reaches 4%, the recovery rates of RobuSeqNet and majority-alignment-based methods drop noticeably, whereas MACL consistently maintains a recovery rate close to 100%, demonstrating strong robustness to sequencing errors. When the sequencing coverage decreases to 20 and 10, the performance gap between different methods becomes increasingly pronounced. The reconstruction and recovery rates of existing methods degrade rapidly with increasing error rates, especially at low coverage, where insufficient read redundancy limits reliable sequence inference. In contrast, MACL consistently outperforms all baselines across all error rates and coverage settings, exhibiting significantly higher reconstruction and recovery rates under both moderate and low sequencing coverage. Overall, these results indicate that MACL not only achieves superior performance at high coverage, but also demonstrates strong robustness under low-coverage and high-error conditions, highlighting its effectiveness in practical scenarios where sequencing depth is limited.
Fig. 3.
Comparing sequence reconstruction and recovery rates under varying sequencing coverages with simulated error rates from 1% to 5%.
This performance advantage arises primarily from the multi-scale attention module design. At high error rates, sequence offsets increase, and simple sequence comparison algorithms or single-scale attention mechanisms struggle to capture complex error patterns effectively. In contrast, MACL’s multi-scale attention module extracts more local correlation information at both the base and sequence scales, accurately identifying error patterns such as substitutions, insertions, and deletions, and effectively locating and correcting errors. In addition, this paper conducts mixed training on sequences with varying error rates to build a reconstruction model that adapts to mixed error rates. As shown in Fig. 4, MACL outperforms RobuSeqNet in sequence reconstruction and recovery rates at all error rates, further confirming its robustness and scalability.
Fig. 4.
Comparing sequence reconstruction and recovery rates for models trained at mixed error rates.
Table 2.
Comparison of sequence reconstruction and recovery rates on a real-world DNA storage dataset under different sequencing coverages.
| Reconstruction Rate(%) at a Sequencing Coverage of 30 | ||||
| Methods | Erlich | Organick | Grass | Srinivasa |
| Hybrid BMA | 99.99 | 99.99 | 99.11 | 85.79 |
| Divider BMA | 99.99 | 99.99 | 69.97 | 51.02 |
| RobuSeqNet | 99.96 | 99.93 | 98.29 | 91.28 |
| DNAFormer | 99.99 | 99.99 | 99.41 | 90.64 |
| MACL | 99.99 | 99.99 | 99.28 | 98.68 |
| Recovery Rate(%) at a Sequencing Coverage of 30 | ||||
| Hybrid BMA | 99.97 | 99.98 | 3.08 | 8.64 |
| Divider BMA | 99.97 | 99.98 | 0.00 | 0.00 |
| RobuSeqNet | 94.33 | 96.07 | 66.55 | 5.22 |
| DNAFormer | 99.21 | 99.82 | 80.64 | 45.02 |
| MACL | 99.59 | 99.85 | 82.64 | 70.59 |
| Reconstruction Rate (%) at a Sequencing Coverage of 20 | ||||
| Hybrid BMA | 88.68 | 32.61 | 93.43 | 76.86 |
| Divider BMA | 99.99 | 99.93 | 99.64 | 83.87 |
| RobuSeqNet | 97.30 | 87.66 | 87.66 | 74.51 |
| DNAFormer | 99.81 | 99.99 | 96.88 | 86.65 |
| MACL | 99.99 | 99.99 | 99.28 | 97.38 |
| Recovery Rate (%) at a Sequencing Coverage of 20 | ||||
| Hybrid BMA | 10.6 | 96.02 | 19.23 | 0.47 |
| Divider BMA | 99.96 | 99.98 | 93.63 | 1.72 |
| RobuSeqNet | 99.97 | 6.70 | 0.06 | 0.00 |
| DNAFormer | 99.90 | 99.97 | 84.27 | 24.21 |
| MACL | 99.98 | 99.98 | 84.51 | 42.39 |
| Reconstruction Rate (%) at a Sequencing Coverage of 10 | ||||
| Hybrid BMA | 90.89 | 93.81 | 98.90 | 77.69 |
| Divider BMA | 99.99 | 99.99 | 99.99 | 84.63 |
| RobuSeqNet | 92.26 | 98.55 | 85.99 | 70.18 |
| DNAFormer | 99.89 | 99.99 | 96.81 | 86.63 |
| MACL | 99.93 | 99.99 | 98.57 | 95.57 |
| Recovery Rate (%) at a Sequencing Coverage of 10 | ||||
| Hybrid BMA | 15.34 | 17.80 | 78.82 | 3.18 |
| Divider BMA | 99.92 | 99.92 | 99.92 | 1.69 |
| RobuSeqNet | 99.87 | 99.95 | 0.06 | 0.00 |
| DNAFormer | 99.91 | 99.96 | 83.34 | 13.44 |
| MACL | 95.87 | 99.98 | 70.24 | 19.62 |
4.3. DNA storage image reconstruction
DNA data storage consists of two main components: data writing and data reading [49]. In practical applications, achieving accurate recovery of the original data is the key goal of DNA storage data reading [8]. However, most current research efforts for reading DNA storage data are at the level of sequence reconstruction and are not restored to the original data [15], [50], which deviates from the original intention of data storage. Therefore, to further validate MACL’s end-to-end reconstruction performance in DNA storage, this paper designed a medical image reconstruction task and employed multiple DNA storage codec methods to encode and decode various medical images. The experiments use common DNA encoding methods, including 0/1 Mapping Code, Yin-Yang Code [51], and DNA Palette Code [41]. These three methods are representative DNA storage encoding and decoding schemes with different design emphases. Specifically, Yin-Yang Code and DNA Palette Code explicitly incorporate biological constraints such as GC content balance and homopolymer length limitation, whereas the 0/1 Mapping Code does not impose such constraints and is therefore included as an unconstrained baseline. These encoding schemes are used to encode and decode MRI slices, CT images, and fundus images, respectively, enabling a controlled evaluation of reconstruction performance under both constrained and unconstrained encoding conditions. It should be noted that due to limitations in DNA storage decoding algorithms, extremely rare errors at critical positions may result in complete failure or poor quality of image reconstruction. Therefore, there is no perfect correlation between sequence-level reconstruction rates and image restoration quality. Consequently, this paper employs a series of image quality assessment metrics to quantify MACL’s performance in MRI image reconstruction.
During the writing phase, medical images are encoded into DNA sequences that conform to biological constraints, and sequencing noise is introduced using DNA-storalator [52] to simulate the empirical error characteristics of Illumina sequencing, where substitution, deletion, and insertion errors follow an approximate ratio of 10: 5: 1 [22]. We designed two sets of comparison experiments: one incorporating Reed–Solomon (RS) error correction codes and the other without RS codes. In all experiments involving RS codes, identical RS parameters were applied across all methods to ensure fair and consistent comparisons. For all experiments, the same clustering method is applied to the noisy reads prior to decoding, ensuring that all sequence reconstruction models operate on identical clustered inputs. In the read phase, clustered noisy reads are reconstructed using different sequence reconstruction models, and the reconstructed DNA sequences are then decoded into binary data to recover the original images.
Visual Analysis: To more intuitively evaluate the end-to-end image recovery performance after sequence reconstruction, we conduct a visual analysis of the decoded images. In addition to quantitative PSNR metrics, the reconstructed images are visualized in their original PNG file format, which is a lossless format widely used in medical imaging. This ensures that the displayed results faithfully reflect the decoding outcomes without introducing additional compression artifacts, thereby enabling a reliable and intuitive comparison of visual quality across different methods. Fig. 5 illustrates the impact of various reconstruction methods on the recovery of the brain MRI slice using 0/1 mapping coding. When only the RS error correction code is used and the error rate exceeds 3%, the image cannot be reconstructed due to exceeding the error correction capability of the RS error correction code. Although the BMA-based sequence reconstruction method alleviates the issue to some extent, significant pixel loss and image misalignment persist at high error rate. Because MACL adopts a data augmentation method that better fits the error types of DNA sequencing channels, through data augmentation, the model can better learn the representations of errors, enabling the original image to be perfectly reconstructed even at a high error rate.
Fig. 5.
Comparison of qualitative results of image reconstruction of MRI slices of the brain.
To avoid the influence of image preferences, we have introduced a large number of white (Figs. 6) and color (Fig. 7) medical images. Them present the recovery results of chest CT and fundus images encoded using the Yin-Yang Code and DNA Palette Code, at various error rates. At high error rates, other methods do not take full advantage of the local correlation between sequence copies, generally exhibit color distortion or misalignment, which degrades the quality of the recovered images. In contrast, the MACL proposed in this paper shows considerable advantages in image recovery, particularly under highly error rate conditions. MACL captures global sequence features through a multi-scale attention mechanism and improves the model’s adaptability to complex noise environments by using various convolution sizes for dynamic interaction and local feature adjustment. The experimental results indicate that MACL can accurately correct errors, achieve high-quality image recovery, and generate clear and realistic images.
Fig. 6.
Comparison of qualitative results of chest CT image reconstruction.
Fig. 7.
Comparison of qualitative results of fundus image reconstruction.
Additionally, the consistency of the results across different encoding methods further demonstrates MACL’s versatility. MACL’s performance is not dependent on a specific encoding method or data structure, but instead utilizes the noisy characteristics inherent in the sequencing channels. In summary, MACL’s performance in an end-to-end DNA storage system showcases its robustness and superior performance in highly error rate conditions, offering a reliable solution for practical DNA storage applications.
Quantitative analysis: From a quantitative perspective, we further analyze the reconstruction results. The experimental results show that MACL outperforms other methods in the reconstruction performance in various experimental settings. Table 3 show the quantitative results of image recovery for different sequence reconstruction methods at simulated error rates 5%. The results in the table are the mean values of the three coding schemes in different medical images. For the residual errors that persist after sequence reconstruction, this paper employs Reed–Solomon (RS) error correction codes during the decoding stage. As can be seen from the results in Table 3, when MACL is combined with RS error correction codes, lossless decoding and image restoration can be achieved; in contrast, when only using MACL, the failure rate of sequence reconstruction is 0.36%, while when relying solely on RS error correction codes, the failure rate is as high as 86.77%. These results indicate that MACL plays a crucial role in significantly reducing sequence reconstruction errors and effectively alleviates the burden of error correction codes. This is primarily due to the unique design of the multi-scale attention mechanism, which fully utilizes the correlation information between sequence copies at multiple scales to accurately capture error distribution and locate error sites.
Table 3.
Medical image reconstruction assessment for different sequence reconstruction methods at 5% error rate.
| Methods | SSIM | MS-SSIM | MSE | PSNR | Failure rate |
|---|---|---|---|---|---|
| MACL + RS | 1.000 | 1.000 | 0.000 | inf | 0.36% |
| RobuSeqN + RS | 0.285 | 0.248 | 75.27 | 6.265 | 2.96% |
| MACL | 0.792 | 0.809 | 33.49 | 19.53 | 0.36% |
| RobuSeqN | 0.355 | 0.321 | 84.09 | 9.021 | 2.96% |
| Hybrid BMA | 0.241 | 0.206 | 90.91 | 7.339 | 5.12% |
| Divider BMA | 0.229 | 0.191 | 91.200 | 7.2311 | 5.49% |
| RS | 0.150 | 0.105 | 94.303 | 8.6857 | 86.77% |
4.4. DNA sequence reconstruction in genomic
DNA sequence reconstruction is crucial in genomic research, and obtaining high-quality sequences is of great significance in fields such as precision medicine, drug development, and genomic data analysis [53]. However, variations in sequencing depth and sequencing errors inevitably introduce noise and bias into sequencing reads, placing high demands on reconstruction methods [54]. To evaluate the performance of MACL on natural genomic sequences in a supervised learning setting, we constructed a dataset based on the real SARS-CoV-2 reference genome [55]. Since raw sequencing data of natural genomes do not provide labeled reference-read pairs required for quantitative evaluation, Illumina-specific sequencing errors were simulated using DNA-storalator to generate paired noisy reads in a controlled and reproducible manner. This design enables fair comparison of sequence reconstruction methods without imposing DNA storage-specific biological constraints.
As shown in Table 4, In the case of the nanopore sequencing method with the highest error rate, MACL increased the sequence reconstruction rate by 4.59% to 26.81% compared with other sequencing methods, and increased the sequence recovery rate by 18.08% compared with the suboptimal method. This advantage arises from the global modeling capability of the multiscale attention mechanism to capture long-range dependencies, as well as the enhanced robustness of sequence characterization through comparative learning with positive and negative samples. Furthermore, analyzing the editing distance between the reference and reconstructed sequences (Table 4), it is further confirmed that MACL can accurately locate most of the base sites, even when the sequence is not fully reconstructed.
Table 4.
Comparison of sequence reconstruction and recovery rates of SARS-CoV-2 viral genome sequences under different sequence reconstruction methods.
| Reconstruction Rate (%) | Edit Dist | ||||
| Methods | MiSeq | NextSeq | MinION | 1 | 2-5 |
| This Paper | 100 | 99.99 | 97.37 | 2 | 0 |
| RobuSeqNet | 100 | 99.97 | 92.78 | 7 | 1 |
| Hybrid BMA | 100 | 99.99 | 85.79 | 11 | 4 |
| Divider BMA | 100 | 99.99 | 70.56 | 11 | 5 |
| DNAFormer | 100 | 99.93 | 90.64 | 8 | 5 |
| Recovery Rate (%) | Edit Dist | ||||
| This Paper | 100 | 99.63 | 63.10 | 2 | 0 |
| RobuSeqNet | 100 | 98.15 | 5.54 | 7 | 1 |
| Hybrid BMA | 100 | 99.63 | 0.00 | 11 | 4 |
| Divider BMA | 100 | 99.29 | 0.00 | 11 | 5 |
| DNAFormer | 100 | 99.79 | 45.02 | 8 | 5 |
4.5. Ablation study
We presents three ablation experiments designed to evaluate the contributions of the MSA Transformer module and the contrast learning term by individually removing them. Table 5 presents the results of the ablation experiment for different models in two datasets, where MSA refers to the MSA Transformer module, and CL denotes contrast loss. The experimental results show that the MSA Transformer is crucial for capturing both global and local sequence features, while a linear layer alone cannot achieve comparable feature extraction. Additionally, the contrast loss is indispensable for improving the model’s noise robustness and discriminative ability. The MSA Transformer effectively captures complex correlations among sequence copies using its unique attention mechanism, greatly enhancing feature extraction quality.
Table 5.
Ablation study on Grass and Srinivasavaradhan datasets, showing both reconstruction and recovery rates.
| Model | Grass | Srinivasavaradhan |
|---|---|---|
| -MSA (Reconstruction) | 98.84% | 91.37% |
| (Recovery) | 72.32% | 11.90% |
| -CL (Reconstruction) | 99.10% | 96.55% |
| (Recovery) | 78.03% | 38.19% |
| -MSA-CL (Reconstruction) | 98.78% | 90.86% |
| (Recovery) | 68.92% | 5.08% |
| MACL (Reconstruction) | 99.28% | 98.68% |
| (Recovery) | 82.64% | 70.59% |
In contrast, contrast loss enhances the model’s capacity to handle highly error rate DNA sequences by optimizing sequence distribution in the representation space, with notable advantages in distinguishing correct from noisy sequences. A comprehensive analysis demonstrates that the MACL combines the MSA Transformer’s feature extraction capabilities with the robust optimization effects of contrast loss, outperforming other model variants in reconstruction performance. These results validate the necessity and effectiveness of the MSA Transformer and contrast loss in the MACL, offering valuable theoretical insights and practical guidance for developing a high-performance DNA sequence reconstruction model.
4.6. Computational efficiency analysis of MACL
To comprehensively evaluate the efficiency and feasibility of MACL in DNA sequence reconstruction tasks, this work systematically analyzes its performance against existing baseline methods across four real-world DNA storage datasets – Erlich, Grass, Organick, and Srinivasavaradhan – in terms of memory consumption and inference speed. It should be noted that heuristic methods (DividerBMA, HybridBMA) report peak CPU memory usage, while deep learning methods (DNAformer, MACL) report peak GPU memory usage. Due to the fundamental differences in hardware and access mechanisms between these two memory types, direct numerical comparisons are not possible. However, they serve as resource consumption references for their respective method types. Table 6, Table 7, Table 8, Table 9 presents all comparison results. Compared to DNAformer, MACL achieves competitive or even superior reconstruction performance. Furthermore, MACL demonstrates significant resource efficiency advantages across all datasets. It not only occupies less GPU memory but also achieves faster inference speeds, making it more suitable for large-scale deployment. Compared to heuristic methods, MACL demonstrates superior reconstruction performance on the ONT dataset, which exhibits higher error rates. This demonstrates that MACL exhibits greater robustness when processing real, complex, and noisy DNA sequences. HybridBMA consistently remains the most competitive baseline method in terms of inference speed. Subsequently, To further quantify the efficiency of MACL, this work systematically evaluated the computational complexity of MACL. The model was found to contain 3.28 million trainable parameters, and this paper estimated MACL’s floating-point operations (FLOPs) using the thop library. Across multiple independent tests, single-sample FLOPs measurements fluctuated between 0.530 and 0.770 GFLOPs, averaging approximately 0.65 GFLOPs. Inference speed and memory consumption results on four real DNA storage datasets further demonstrate that MACL exhibits lower computational complexity than DNAformer for DNA sequence reconstruction tasks, making it more suitable for real-time applications.
Table 6.
Performance on Erlich Dataset (30 Sequencing Coverage). Mem: CPU/GPU Memory; Time: Reconstruction Time; Recon: Reconstruction Rate; Recov: Recovery Rate. Neural network methods require GPU while heuristic methods run on CPU.
| Method | Mem (MB) |
Time (s) |
Recon (%) |
Recov (%) |
|---|---|---|---|---|
| HybridBMA | 4.14 | 32.18 | 99.99 | 99.97 |
| DividerBMA | 4.12 | 84.04 | 99.99 | 99.97 |
| RobuSeqNet | 566 | 43.93 | 99.96 | 94.33 |
| DNAformer | 1525.76 | 426.515 | 99.99 | 99.21 |
| MACL | 742.66 | 42.94 | 99.99 | 99.59 |
Table 7.
Performance on Organick Dataset (30 Sequencing Coverage). Mem: CPU/GPU Memory; Time: Reconstruction Time; Recon: Reconstruction Rate; Recov: Recovery Rate. Neural network methods require GPU while heuristic methods run on CPU.
| Method | Mem (MB) |
Time (s) |
Recon (%) |
Recov (%) |
|---|---|---|---|---|
| HybridBMA | 3.77 | 26.38 | 99.99 | 99.98 |
| DividerBMA | 2.04 | 3.62 | 99.99 | 99.98 |
| RobuSeqNet | 365.4 | 2192.66 | 99.93 | 96.07 |
| DNAformer | 1556.48 | 32.58 | 99.99 | 99.82 |
| MACL | 573.72 | 2.88 | 99.99 | 99.85 |
Table 8.
Performance on Grass Dataset (30 Sequencing Coverage). Mem: CPU/GPU Memory; Time: Reconstruction Time; Recon: Reconstruction Rate; Recov: Recovery Rate. Neural network methods require GPU while heuristic methods run on CPU.
| Method | Mem (MB) |
Time (s) |
Recon (%) |
Recov (%) |
|---|---|---|---|---|
| HybridBMA | 3.54 | 94.03 | 99.11 | 3.08 |
| DividerBMA | 3.07 | 360.21 | 69.97 | 0.00 |
| RobuSeqNet | 395.9 | 31.45 | 98.29 | 66.55 |
| DNAformer | 1710.08 | 4233.47 | 99.41 | 80.64 |
| MACL | 542.49 | 198.00 | 99.28 | 82.64 |
Table 9.
Performance on Srinivasavaradhan Dataset (30 Sequencing Coverage). Mem: CPU/GPU Memory; Time: Reconstruction Time; Recon: Reconstruction Rate; Recov: Recovery Rate. Neural network methods require GPU while heuristic methods run on CPU.
| Method | Mem (MB) |
Time (s) |
Recon (%) |
Recov (%) |
|---|---|---|---|---|
| HybridBMA | 3.83 | 141.43 | 85.79 | 8.64 |
| DividerBMA | 2.81 | 6.49 | 51.02 | 0.00 |
| RobuSeqNet | 366.1 | 45.27 | 91.28 | 5.22 |
| DNAformer | 1556.48 | 60.02 | 90.64 | 45.02 |
| MACL | 542.49 | 3.89 | 98.68 | 70.59 |
5. Discussion and conclusion
To tackle the challenge of accurately reconstructing DNA sequencing reads in highly error rate conditions, this paper introduces MACL, a sequence reconstruction model that integrates the MSA Transformer and comparative learning. The results of comparative experiments conducted on multiple real-world DNA storage datasets and under different simulated error rate conditions show that MACL significantly outperforms the existing methods in terms of sequence reconstruction rate and recovery rate and especially demonstrates excellent robustness and applicability in highly error rate conditions.
Specifically, under low error rate conditions, the performance of MACL is comparable to other methods. When the error rate increases to 5%, its sequence recovery rate significantly outperforms deep learning methods such as RobuSeqNet, and far exceeds the traditional sequence reconstruction methods based on bitwise majority alignment. This performance advantage is mainly attributed to the multi-scale attention mechanism in MACL, which can effectively capture multiple error patterns (including substitutions, insertions, and deletions), thus realizing accurate reconstruction in complex noise environments.
In addition, MACL exhibits excellent end-to-end data reconstruction capability in a simulated DNA storage system. When combined with the RS error correction code, MACL can achieve perfect image recovery. Even without the aid of an error correction code, MACL is still able to achieve remarkable results, indicating that it is suitable for a wide range of DNA encoding methods and has a wide range of potential for practical applications. In summary, MACL provides an effective solution for DNA sequence reconstruction under highly error rate conditions with excellent performance and robustness.
The current work primarily relies on the ONT MinION and Illumina sequencing platforms, which are mainstream in the DNA storage field. As DNA storage technology continues to advance, we will prioritize leveraging the ultra-long read capabilities of novel sequencing platforms, which will serve as a key direction for future research. In the future, we will focus on optimizing the model architecture, considering the incorporation of base quality scores as supplementary information for model embedding to enhance the confidence of prediction results. We will also improve computational efficiency, explore its potential in more application scenarios, and integrate it with other bioinformatics tools to further advance the development of DNA storage and genomics.
CRediT authorship contribution statement
Xue Li: Writing – original draft, Validation, Software, Resources, Data curation. Yanfen Zheng: Resources, Formal analysis, Data curation. Qi Shao: Software, Resources, Methodology. Jiadong Wang: Resources, Methodology. Wei Li: Formal analysis, Data curation. Bin Wang: Investigation, Funding acquisition. Shihua Zhou: Funding acquisition, Formal analysis. Ben Cao: Methodology, Funding acquisition, Formal analysis. Pan Zheng: Formal analysis, Data curation.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is supported by 111 Center (No. D23006), the National Natural Science Foundation of China (Nos. 62572088, 62272079, 62502063), the National Foreign Expert Project of China (No. D20240244), Natural Science Foundation of Liaoning Province (Nos. 2024-MS-212, 2024-BS-267), Scientific Research Project of Liaoning Provincial Department of Education (No. LJ222411258005), LiaoNing Revitalization Talent Program (No. XLYC2403039), the Artificial Intelligence Innovation Development Plan Project of Liaoning Province (No. 2023JH26/10300025), Joint Plan of Liaoning Province Science and Technology Plan (Nos. 2024JH2/102600064, 2024-MSLH-009), the Dalian Outstanding Young Science and Technology Talent Support Program (No. 2022RJ08), Dalian Major Projects of Basic Research (No. 2023JJ11CG002), the Dalian Young Science and Technology Star Program (No. 2023RQ056), the Interdisciplinary Project of Dalian University (Nos. DLUXK-2024-YB-001, DLUXK-2025-FX-003, DLUXK-2025-QNLG-003, DLUXK-2024-QN-002).
Footnotes
Peer review under the responsibility of Editorial Board of Synthetic and Systems Biotechnology.
Contributor Information
Bin Wang, Email: wangbin@dlu.edu.cn.
Shihua Zhou, Email: zhoushihua@dlu.edu.cn.
Ben Cao, Email: caoben@ieee.org.
References
- 1.Yang S., Bögels B.W., Wang F., Xu C., Dou H., Mann S., Fan C., de Greef T.F. DNA as a universal chemical substrate for computing and data storage. Nat Rev Chem. 2024;8(3):179–194. doi: 10.1038/s41570-024-00576-4. [DOI] [PubMed] [Google Scholar]
- 2.Cao B., Zhao Y., Xie L., Shao Q., Wang K., Wang B., Zhou S., Zheng P. DBSP: An end-to-end pipeline for DNA storage data reconstruction from DNA sequencing. IEEE Trans Mol Biological Multi-Scale Commun. 2025:157–170. doi: 10.1109/TMBMC.2025.3613268. [DOI] [Google Scholar]
- 3.Heckel R., Mikutis G., Grass R.N. A characterization of the DNA data storage channel. Sci Rep. 2019;9(1):9663. doi: 10.1038/s41598-019-45832-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Sini M.A., Yaakobi E. 2019 IEEE international symposium on information theory. IEEE; 2019. Reconstruction of sequences in DNA storage; pp. 290–294. [Google Scholar]
- 5.Xu Q., Zhou Y., Sun Q., Zhao X., Lu Z., Bi K. DNA-CTMF: Reconstruct high quality image from lossy DNA storage via pixel-base codebook and median filter. Synth Syst Biotechnol. 2025;10(3):925–935. doi: 10.1016/j.synbio.2025.04.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Shen P., Zheng Y., Zhang C., Li S., Chen Y., Chen Y., Liu Y., Cai Z. [DNA] storage: The future direction for medical cold data storage. Synth Syst Biotechnol. 2025;10(2):677–695. doi: 10.1016/j.synbio.2025.03.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Prjibelski A., Antipov D., Meleshko D., Lapidus A., Korobeynikov A. Using spades de novo assembler. Curr Protoc Bioinform. 2020;70(1) doi: 10.1002/cpbi.102. [DOI] [PubMed] [Google Scholar]
- 8.Cao B., Zheng Y., Shao Q., Liu Z., Xie L., Zhao Y., Wang B., Zhang Q., Wei X. Efficient data reconstruction: The bottleneck of large-scale application of DNA storage. Cell Rep. 2024;43(4) doi: 10.1016/j.celrep.2024.113699. [DOI] [PubMed] [Google Scholar]
- 9.Cao B., Xue L., Wang B., He T., Zheng Y., Zhang X., Zhang Q. Achieving handle-level random access in an encrypted DNA archival storage system via frequency dictionary mapping coding. Patterns. 2025;1(1) doi: 10.1016/j.patter.2025.101288. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Chu L., Su Y., Yao X., Xu P., Liu W. A review of DNA cryptography. Intell Comput. 2025;4:0106. [Google Scholar]
- 11.Wang S., Mao X., Wang F., Zuo X., Fan C. Data storage using DNA. Adv Mater. 2024;36(6) doi: 10.1002/adma.202307499. [DOI] [PubMed] [Google Scholar]
- 12.Doricchi A., Platnich C.M., Gimpel A., Horn F., Earle M., Lanzavecchia G., Cortajarena A.L., Liz-Marzán L.M., Liu N., Heckel R., et al. Emerging approaches to DNA data storage: challenges and prospects. ACS Nano. 2022;16(11):17552–17571. doi: 10.1021/acsnano.2c06748. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Rasool A., Hong J., Hong Z., Li Y., Zou C., Chen H., Qu Q., Wang Y., Jiang Q., Huang X., et al. An effective DNA-based file storage system for practical archiving and retrieval of medical MRI data. Small Methods. 2024;8(10) doi: 10.1002/smtd.202301585. [DOI] [PubMed] [Google Scholar]
- 14.Rasool A. RFS-codec: A novel encoding approach to store image data in DNA. J Artif Intell Bioinform. 2025;1(1):41–50. [Google Scholar]
- 15.Xie R., Zan X., Chu L., Su Y., Xu P., Liu W. Study of the error correction capability of multiple sequence alignment algorithm (mafft) in DNA storage. BMC Bioinformatics. 2023;24(1):111. doi: 10.1186/s12859-023-05237-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Qin Y., Zhu F., Xi B., Song L. Robust multi-read reconstruction from noisy clusters using deep neural network for DNA storage. Comput Struct Biotechnol J. 2024;23:1076–1087. doi: 10.1016/j.csbj.2024.02.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Rao R.M., Liu J., Verkuil R., Meier J., Canny J., Abbeel P., Sercu T., Rives A. International conference on machine learning. PMLR; 2021. MSA transformer; pp. 8844–8856. [Google Scholar]
- 18.Ge Q., Qin R., Liu S., Guo Q., Han C., Chen W. Pragmatic soft-decision data readout of encoded large DNA. Brief Bioinform. 2025;26(2):bbaf102. doi: 10.1093/bib/bbaf102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Welzel M., Schwarz P.M., Löchel H.F., Kabdullayeva T., Clemens S., Becker A., Freisleben B., Heider D. DNA-Aeon provides flexible arithmetic coding for constraint adherence and error correction in DNA storage. Nat Commun. 2023;14(1):bbaf628. doi: 10.1038/s41467-023-36297-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Press W.H., Hawkins J.A., Jones S.K., Jr., Schaub J.M., Finkelstein I.J. HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints. Proc Natl Acad Sci. 2020;117(31):18489–18496. doi: 10.1073/pnas.2004821117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Kim J.-W., Jeong J., Kwak H.-Y., No J.-S. Design of DNA storage coding scheme with LDPC codes and interleaving. IEEE Trans NanoBioscience. 2024;23(3) doi: 10.1109/TNB.2024.3379976. [DOI] [PubMed] [Google Scholar]
- 22.Organick L., Ang S.D., Chen Y.-J., Lopez R., Yekhanin S., Makarychev K., Racz M.Z., Kamath G., Gopalan P., Nguyen B., et al. Random access in large-scale DNA data storage. Nature Biotechnol. 2018;36(3):242–248. doi: 10.1038/nbt.4079. [DOI] [PubMed] [Google Scholar]
- 23.Qu G., Yan Z., Wu H. Clover: tree structure-based efficient DNA clustering for DNA-based data storage. Brief Bioinform. 2022;23(5):bbac336. doi: 10.1093/bib/bbac336. [DOI] [PubMed] [Google Scholar]
- 24.Bar-Lev D., Orr I., Sabary O., Etzion T., Yaakobi E. Scalable and robust DNA-based storage via coding theory and deep learning. Nat Mach Intell. 2025:1–11. [Google Scholar]
- 25.Chen W., Qin R., Guo Q., Guo J., Ge Q., Yuan Y. Approaching single-molecule assembly-free readout from medium-length encoded DNA. Nat Commun. 2025;16(1):10059. doi: 10.1038/s41467-025-65004-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Liu Z., Li X., Xie L., Wang B., Zhou S., Cao B., Pan Z., Zhang Q. DVOUG enables robust DNA sequence assembly and reconstruction with a dynamic, variable-order graph. Cell Rep Method. 2025;1 doi: 10.1016/j.crmeth.2025.101243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Gopalan P.S., Yekhanin S., Ang S.D., Jojic N., Racz M., Strauss K., Ceze L. 2018. Trace reconstruction from noisy polynucleotide sequencer reads. US Patent US20180211001A1 (Jul. 26 2018) [Google Scholar]
- 28.Sabary O., Yucovich A., Shapira G., Yaakobi E. Reconstruction algorithms for DNA-storage systems. Sci Rep. 2024;14(1):1951. doi: 10.1038/s41598-024-51730-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Wang J., Wang B., Zhou S., Ben C., Wei L., Pan Z. DNACSE: Enhancing genomic llms with contrastive learning for dna barcode identification. J Chem Inf Model. 2026;64(5):1719–1729. doi: 10.1021/acs.jcim.5c02747. [DOI] [PubMed] [Google Scholar]
- 30.Nahum Y., Ben-Tolila E., Anavy L. 2021. Single-read reconstruction for DNA data storage using transformers. arXiv preprint arXiv:2109.05478. [Google Scholar]
- 31.Li X., Cao B., Wang J., Meng X., Wang S., Huang Y., Petretto E., Song T. Predicting mutation-disease associations through protein interactions via deep learning. IEEE J Biomed Health Inform. 2025;29(6):4512–4523. doi: 10.1109/JBHI.2025.3541848. [DOI] [PubMed] [Google Scholar]
- 32.Yang M., Wang Z., Yan Z., Wang W., Zhu Q., Jin C. DNASimCLR: a contrastive learning-based deep learning approach for gene sequence data classification. BMC Bioinformatics. 2024;25(1):328. doi: 10.1186/s12859-024-05955-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Ge J., Wang J., Ye Q., Pan L., Kang Y., Shen C., Deng Y., Hsieh C.-Y., Hou T. TRAP: A contrastive learning-enhanced framework for robust TCR-pMHC binding prediction with improved generalizability. Chem Sci. 2025;16:9881–9894. doi: 10.1039/d4sc08141b. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Zhang P., Jiang Z., Wang Y., Li Y. International conference on research in computational molecular biology. Springer; 2022. CLMB: Deep contrastive learning for robust metagenomic binning; pp. 326–348. [Google Scholar]
- 35.Sokolova K., Chen K.M., Troyanskaya O.G. 2024. Contrastive pre-training for sequence based genomics models. bioRxiv preprint . [DOI] [Google Scholar]
- 36.Wang Z., You R., Han H., Liu W., Sun F., Zhu S. Effective binning of metagenomic contigs using contrastive multi-view representation learning. Nat Commun. 2024;15(1):585. doi: 10.1038/s41467-023-44290-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Xu C., Zhao C., Ma B., Liu H. Uncertainties in synthetic DNA-based data storage. Nucleic Acids Res. 2021;49(10):5451–5469. doi: 10.1093/nar/gkab230. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Takase S., Okazaki N. 2019. Positional encoding to control output sequence length. arXiv preprint arXiv:1904.07418. [Google Scholar]
- 39.Zhou H., He T., Ong Y.-S., Cong G., Chen Q. Differentiable clustering for graph attention. IEEE Trans Knowl Data Eng. 2024;36(8):3751–3764. [Google Scholar]
- 40.Wu H., Qu Y., Lin S., Zhou J., Qiao R., Zhang Z., Xie Y., Ma L. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021. Contrastive learning for compact single image dehazing; pp. 10551–10560. [Google Scholar]
- 41.Yan Z., Zhang H., Lu B., Han T., Tong X., Yuan Y. DNA palette code for time-series archival data storage. Natl Sci Rev. 2025;12(1):nwae321. doi: 10.1093/nsr/nwae321. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Loey M., Manogaran G., Khalifa N.E.M. A deep transfer learning model with classical data augmentation and cgan to detect covid-19 from chest ct radiography digital images. Neural Comput Appl. 2020:1–13. doi: 10.1007/s00521-020-05437-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Kiefer R., Abid M., Ardali M.R., Steen J., Amjadian E. 2023 8th international conference on image, vision and computing. IEEE; 2023. Automated fundus image standardization using a dynamic global foreground threshold algorithm; pp. 460–465. [Google Scholar]
- 44.Bhardwaj V., Pevzner P.A., Rashtchian C., Safonova Y. Trace reconstruction problems in computational biology. IEEE Trans Inform Theory. 2020;67(6):3295–3314. doi: 10.1109/tit.2020.3030569. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Erlich Y., Zielinski D. DNA fountain enables a robust and efficient storage architecture. Science. 2017;355(6328):950–954. doi: 10.1126/science.aaj2038. [DOI] [PubMed] [Google Scholar]
- 46.Grass R.N., Heckel R., Puddu M., Paunescu D., Stark W.J. Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angew Chem Int Ed. 2015;54(8):2552–2555. doi: 10.1002/anie.201411378. [DOI] [PubMed] [Google Scholar]
- 47.Srinivasavaradhan S.R., Gopi S., Pfister H.D., Yekhanin S. 2021 IEEE international symposium on information theory. IEEE; 2021. Trellis BMA: Coded trace reconstruction on IDS channels for DNA storage; pp. 2453–2458. [Google Scholar]
- 48.Wang P., Cao B., Ma T., Wang B., Zhang Q., Zheng P. DUHI: dynamically updated hash index clustering method for DNA storage. Comput Biol Med. 2023;164 doi: 10.1016/j.compbiomed.2023.107244. [DOI] [PubMed] [Google Scholar]
- 49.Meiser L.C., Antkowiak P.L., Koch J., Chen W.D., Kohll A.X., Stark W.J., Heckel R., Grass R.N. Reading and writing digital data in DNA. Nat Protoc. 2020;15(1):86–101. doi: 10.1038/s41596-019-0244-5. [DOI] [PubMed] [Google Scholar]
- 50.Xie L., Cao B., Wen X., Zheng Y., Wang B., Zhou S., Zheng P. Relume: enhancing DNA storage data reconstruction with flow network and graph partitioning. Methods. 2025;240:101–112. doi: 10.1016/j.ymeth.2025.03.022. [DOI] [PubMed] [Google Scholar]
- 51.Ping Z., Chen S., Zhou G., Huang X., Zhu S.J., Zhang H., Lee H.H., Lan Z., Cui J., Chen T., et al. Towards practical and robust DNA-based data archiving using the yin–yang codec system. Nat Comput Sci. 2022;2(4):234–242. doi: 10.1038/s43588-022-00231-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Chaykin G., Furman N., Sabary O., Ben-Shabat D., Yaakobi E. 13th annual non-volatile memories workshop. 2022. DNA-storalator: end-to-end dna storage simulator. [Google Scholar]
- 53.Mishra P., Maurya R., Avashthi H., Mittal S., Chandra M., Ramteke P.W. Bioinformatics: Methods and Applications. 2022. Genome assembly and annotation; pp. 49–66. [Google Scholar]
- 54.Espinosa E., Bautista R., Larrosa R., Plata O. Advancements in long-read genome sequencing technologies and algorithms. Genomics. 2024;116(3) doi: 10.1016/j.ygeno.2024.110842. [DOI] [PubMed] [Google Scholar]
- 55.Delgado S., Somovilla P., Ferrer-Orta C., Martínez-González B., Vázquez-Monteagudo S., Muñoz-Flores J., Soria M.E., García-Crespo C., de Ávila A.I., Durán-Pastor A., et al. Incipient functional SARS-CoV-2 diversification identified through neural network haplotype maps. Proc Natl Acad Sci. 2024;121(10) doi: 10.1073/pnas.2317851121. [DOI] [PMC free article] [PubMed] [Google Scholar]







