A dual-rule encoding DNA storage system using chaotic mapping to control GC content

Xuncai Zhang; Baonan Qi; Ying Niu

doi:10.1093/bioinformatics/btae113

. 2024 Feb 28;40(3):btae113. doi: 10.1093/bioinformatics/btae113

A dual-rule encoding DNA storage system using chaotic mapping to control GC content

Xuncai Zhang ^1,^✉, Baonan Qi ², Ying Niu ³

Editor: Lenore Cowen

PMCID: PMC10937898 PMID: 38419588

Abstract

Motivation

DNA as a novel storage medium is considered an effective solution to the world’s growing demand for information due to its high density and long-lasting reliability. However, early coding schemes ignored the biologically constrained nature of DNA sequences in pursuit of high density, leading to DNA synthesis and sequencing difficulties. This article proposes a novel DNA storage coding scheme. The system encodes half of the binary data using each of the two GC-content complementary encoding rules to obtain a DNA sequence.

Results

After simulating the encoding of representative document and image file formats, a DNA sequence strictly conforming to biological constraints was obtained, reaching a coding potential of 1.66 bit/nt. In the decoding process, a mechanism to prevent error propagation was introduced. The simulation results demonstrate that by adding Reed-Solomon code, 90% of the data can still be recovered after introducing a 2% error, proving that the proposed DNA storage scheme has high robustness and reliability.

Availability and implementation: The source code for the codec scheme of this paper is available at https://github.com/Mooreniah/DNA-dual-rule-rotary-encoding-storage-system-DRRC.

1 Introduction

With the rapid development of the Internet and the Internet of Things, a vast amount of digital data is continuously generated and accumulated. It is estimated that by 2025, nearly 175 zettabytes (ZB) of data will be generated. Modern storage systems face challenges such as high energy consumption, elevated costs, and inadequate data security. DNA, an ancient and efficient information carrier within living organisms, has garnered the attention of researchers due to its attributes of long-term stability, low energy consumption, high security, and immense storage potential. Compared to conventional information carriers, DNA molecules offer numerous advantages, including exceptionally high storage density. Research has demonstrated that the data storage capacity of 1 g of DNA can reach up to 215 petabytes (PB), ∼22 544 384 000 gigabytes (GB), equivalent to the data storage of 220 000 1TB hard drives (De Silva and Ganegoda 2016). Its distinctive stability enables DNA molecules to be preserved for thousands of years under suitable conditions, such as temperatures below −20°C, humidity ranging from 20% to 40%, protection from light, and vacuum conditions (Allentoft et al. 2012, Bhat 2018).

So far, numerous methods utilizing organic molecules (DNA, RNA, oligopeptides, metabolites) for information storage have been proposed (Anavy et al. 2019, Cafferty et al. 2019, Choi et al. 2019, Kennedy et al. 2019, Koch et al. 2020). The advancement of DNA sequencing technology over the past few decades has made genome sequencing increasingly accessible and widespread. As a result, DNA information storage has emerged as a highly esteemed storage strategy. The traditional DNA storage process typically entails the conversion of binary information from files into DNA base sequences. Subsequently, these encoded sequences are synthesized, assembled, and stored within appropriate mediums to ensure both stability and accessibility. Retrieval and interpretation of the original file information are achieved through the utilization of sequencing technologies.

Although the early codebook ([00, 01, 10, 11] → [A, T, C, G]) significantly increased the encoding capacity of DNA, it introduced specific structural patterns within the nucleotide sequence, leading to particular challenges during synthesis and sequencing processes (Kosuri and Church 2014, Yazdi et al. 2015, Shendure et al. 2017). For instance, excessively long repeated nucleotide sequences (n > 4 nt) can lead to base errors during DNA sequence synthesis or sequencing. Uneven distribution of GC content in DNA sequences, whether too low or too high, can result in uneven distribution during polymerase chain reaction (PCR) amplification, and there is a preference for GC content during sequencing, i.e. regions on the genome with a GC content of around 50% are more likely to be sequenced (Niedringhaus et al. 2011, Ross et al. 2013, Van der Verren et al. 2020). The presence of palindrome subsequences within DNA sequences can lead to secondary structures like hairpins during PCR amplification, ultimately affecting the efficiency of the amplification process. Therefore, the DNA code should adhere to specific rules to reduce systematic errors during the DNA data storage process. In August 2012, Church et al. from Harvard University (Church et al. 2012) achieved DNA storage of HTML files, JPG images, and JavaScript programs with 5.2 MB of data. They first proposed a “bit-base” mapping codebook, where each bit corresponds to one base, which avoids homopolymers of length > 3. However, the data could not be completely recovered due to the limited sequencing depth and the lack of error correction strategies. In contrast, Goldman et al. (2013) added a compression strategy and parity-check codes to increase robustness to errors in the DNA storage system. In 2015, Grass et al. (2015) proposed a new method for coding and decoding “symbol-codon” inner and outer codes. These strategies come at the expense of storage density. Erlich and Zielinski (2017) proposed a novel DNA storage method based on Fountain code (LT code) in 2017 and achieved 2.15 MB data files (including Text, OS, Image, PDF, Movie, and Software) for DNA storage and entirely accurate recovery, which does not require repeated sequencing and significantly reduces redundancy compared to previous coded decoding methods. DNA Fountain introduces screening constraints of low redundancy, homopolymer length, and GC content to improve the information fidelity and achieves a storage density of 1.57 bit/nt using the Luby transform. However, there is a risk of decoding failures when dealing with a particular binary due to the fundamental problem of the Luby transform (Lu et al. 2009). Ping et al. (2022) proposed a yin-yang coding and decoding system (YYC) to encode two binary bits into one base using both yin and yang coding rules. This coding method is robust and highly compatible with synthesis and sequencing techniques, but it has to be filtered from multiple coding schemes. Welzel et al. (2023) proposed a coding scheme that supports variable-sized coding sequences to correct substitutions, insertions, losses, and whole DNA strand errors. DNA-Aeon was reported to have better error correction at similar levels of redundancy and reduced DNA synthesis costs compared to other DNA storage technologies.

Nowadays, there are still some difficulties in the development of coding algorithms, i.e. ensuring coding efficiency while needing to take into account the satisfaction of specific constraints and establishing personalized coding for arbitrary constraints. In this article, we propose a DNA dual rule rotational coding (DRRC), where the two coding rules complement each other, the GC content can be stabilized at 50% by the control of chaotic sequences, and the introduction of rotational coding mode effectively avoids the emergence of long homopolymers and has a low time complexity. Through simulation, it is proven that the scheme presented in this article can store source files of different formats and types with a logical storage density of 1.66 bit/nt and has high robustness to ensure data security.

This article is organized as follows. Section 2 describes the constraints of DNA-based storage coding. Encoding and decoding methods are proposed in Section 3. Simulation results are given in Section 4, and Section 5 concludes the article.

2 Constraints in DNA storage

2.1 GC content constraints

DNA storage requires a stable and homogeneous DNA molecule, which consists of four bases, A, T, C, and G. Guanine and cytosine pair with each other in the form of hydrogen bonds to form a stable pairing relationship, and this base pairing is the basis for the transmission of genetic information and DNA replication. GC content refers to the ratio of the sum of the lengths of guanine and cytosine to the total length of the sequence of DNA. The content constraints must satisfy the range of the chemical assay, typically 40% < GC < 60%. For a sequence S, GC(S) can be expressed as shown in Equation (1), where $|G + C|$ is the sum of the number of G and C in the sequence and $|S|$ is the total length of the sequence (Cao et al. 2022).

\begin{matrix} GC (S) = \frac{|G + C|}{|S|} \end{matrix}

(1)

Ross et al. (2013) conducted a systematic comparison of the GC bias of four sequencing technologies: Complete Genomics DNA nanoball sequencing, Illumina sequencing by synthesis, Ion Torrent semiconductor sequencing, and Pacific Biosciences. The results consistently showed that good coverage was obtained at the mid-range of the GC content of the sequences and showed a decrease in coverage for both high and low GC content sequences. Ion Torrent semiconductor sequencing showed that up to 30% of genomes with low GC content were not covered at all (Quail et al. 2012), while library preparation failed completely for genomes with high GC content (Bragg et al. 2013). Although Illumina HiSeq platform was slightly superior, it exhibited a coverage bias for sequences with extreme GC content that could not be combined with data from different platforms to correct.

2.2 Run length constraints

The length formed by the recurrence of each base in a sequence is called the run length, and during the synthesis and sequencing of DNA molecules, repeated bases may lead to experimental failure. Particularly in sequencing, as the homopolymer length increases, the insertion error rate and deletion error rate increase for all platforms except PacBio RS where the insertion error rate is consistently high, and the most sensitive platform to homopolymer is the Ion Torrent PGM, which was found to produce no reads for homopolymers longer than 14 nucleotides in length in one study (Quail et al. 2012). The run length constraint is defined as:

\begin{matrix} B_{i} \neq B_{i + 1}, i \in [1, n - 1] \end{matrix}

(2)

where $B_{i}$ belongs to a base in a DNA sequence $(B_{1}, B_{2}, B_{3} \dots B_{n})$ of length n. When the maximum run length is >6, the substitution and deletion error rate increases (Schwartz et al. 2012). This also leads to high error rates in the DNA storage process.

3 DNA dual rule rotary encoding storage system

In nature, the DNA molecule consists of two complementary polymer chains, and the complementary nature of the DNA double-stranded structure enables it to form a stable double-helix structure, which can ensure the accuracy and stability of the genetic information in the transmission process. Inspired by the double-stranded structure, this article proposes a new DNA storage system that uses chaotic mapping to control the GC content of base sequences and builds a dual-rule coding table based on the rotational coding of Goldman et al. (2013) to conform to the constraints in the storage of DNA, map binary sequences to base sequences. This storage program can effectively balance the GC content and avoid the generation of homopolymers, enhance the stability of the DNA sequence, and greatly reduce the error rate during sequencing.

3.1 Principles and features of DRRC

As shown in Fig. 1, the principle of DRRC is to divide the binary stream into groups of five bits each, and the first bit of each group is used to control the selection of coding rules, if the first bit has a value of 0, the remaining four bits are coded with the 0-rule, and if the first bit has a value of 1, the remaining four bits are coded with the 1-rule. The coding rules are shown in Fig. 1. The 0-rule coding table constructed in this article has a GC content of 1/3 for all codewords, and the 1-rule coding table has a GC content of 2/3 for all codewords. To balance the GC content of base sequences obtained from the final coding, the value of the first bit of each group of the original binary stream must be controlled. Therefore, before encoding, the first bit of every five bits is extracted to form a sequence. To ensure an equal count of 0 and 1 in this sequence, it is iteratively exclusive OR with a chaotic sequence until 0 and 1 are balanced. The balanced sequence is replaced back to its original position. The altered binary stream is encoded with double rules. Due to this unique encoding method of DRRC, we can control the GC content of the base sequence flexibly. The first bit of every five bits in the binary is not involved in the encoding, the storage density is significantly increased.

DRRC consists of four steps:

Step 1: The binary data is grouped every five bits, and the first bit of each group is randomized to obtain a binary sequence with the first bit 0/1 balanced.

Step 2: Deriving a dual-rule coding table based on the established constraints and coding mapping the binary sequence derived in step 1.

Step 3: DNA synthesis and storage.

Step 4: DNA sequencing and decoding for error correction.

3.2 Randomization of the first position

Chaotic sequences are acyclic, i.e. there are no repeating patterns in the sequence. Although chaotic sequences exhibit complex and irregular behavior, the values of the sequences are uniquely determined for a given mapping or equation and initial conditions. In this article, a one-dimensional logistic mapping is used as in Equation (3), μ is the control parameter, which is set to a fixed value of 3.9 in this scheme. $X_{n}$ is in the range of (0, 1), and the chaotic sequences obtained from the one-dimensional logic mapping need to be binarized by Equation (4), and converted to the corresponding binary sequences to facilitate the subsequent first position randomization. To get the correct chaotic sequence when decoding, three parameters of the one-dimensional logical mapping need to be stored, which are the control parameter 3.9; the length of the chaotic sequence; and the initial value at the time of successful iteration. Encoding these three parameters according to Blawat’s scheme yields a DNA sequence of about 115 nt, and 200 nt when error-correcting and indexing bits are taken into account (Blawat et al. 2016).

\begin{matrix} X_{n + 1} = μ \times X_{n} \times (1 - X_{n}) \end{matrix}

(3)

\begin{matrix} X_{i}^{'} = \{\begin{matrix} 1, {if X}_{i} \geq 0.5 \\ 0, {if X}_{i} \leq 0.5 \end{matrix} \end{matrix}

(4)

The binary sequence to be encoded is S_comp. It needs to control the ratio of 0 and 1 at the first position to balance the GC content of the subsequent encoding so that it falls within 0.5 ± α, where α is controlled by Content0. Figure 2 shows the process of randomizing the first position. Firstly, S_comp is grouped every five bits, and the first bit of each group is taken out and arranged to form a binary sequence called S_first. Then, a one-dimensional logistic map is used to generate a chaotic sequence that is the same length as S_first, called S_chaotic. S_first and S_chaotic are iteratively exclusive OR until a sequence with a 40%–60% content of 0 or 1 is obtained. Finally, this sequence is replaced with the first position of the original sequence, resulting in the randomized binary sequence S_rand. Figure 6D demonstrates that the number and time of iterations in the first position randomization always remain within acceptable limits.

Figure 2. — Randomization of the first position.

Figure 6. — Error correction capability and coding efficiency analysis. (A) Variation of data recovery rate with the introduction of substitution error rate without adding any error correction measures. (B) Variation of data recovery rate with the introduction of insertion/deletion error rate without any error correction measures. (C) Variation of the data recovery rate with the introduction of mixing error rate after the inclusion of the blocking error propagation mechanism and RS code. (D) Time variation of randomization of the first position sequence from 1 to 50 MB.

The randomization process is not only effective in controlling the GC equilibrium but also improves the security of the original message. Chaotic systems are highly sensitive and unpredictable, and their outputs exhibit some degree of irregularity and complexity similar to random sequences. The random sequence generated by the chaotic system is used as a key to encrypt the plaintext, thus achieving the purpose of encryption (Materassi and Basso 2008).

3.3 Dual-rule DNA coding

This section is about mapping binary sequences to base sequences, grouping binary sequences in 5-bit units, and mapping them to base sequences via a table of dual-rule codes to satisfy the constraints of a maximum run length of 3 and GC balancing.

Assuming that the binary sequence after randomization of the first position is S_rand, for each group of mapping process $\{S_{i 1}, S_{i 2}, S_{i 3}, S_{i 4}, S_{i 5}\} \to \{N_{i 1}, N_{i 2}, N_{i 3}\} (1 < i < n, N_{i 1}, N_{i 2}, N_{i 3} \in {A, T, C, G})$ , Equation (6) determines the mapping rules for $S_{i 2}, S_{i 3}, S_{i 4}, S_{i 5}$ based on $S_{i 1}$ .

\begin{matrix} S_{rand} = (S_{11}, S_{12}, S_{13}, S_{14}, S_{15}, \dots S_{n 1}, S_{n 2}, S_{n 3}, S_{n 4}, S_{n 5}) \end{matrix}

(5)

\begin{matrix} \{\begin{matrix} \{S_{i 1}, S_{i 2}, S_{i 3}, S_{i 4}, S_{i 5}\} \overset{0 - rule}{\to} \{N_{i 1}, N_{i 2}, N_{i 3}\}, if S_{i 1} = 0 \\ \{S_{i 1}, S_{i 2}, S_{i 3}, S_{i 4}, S_{i 5}\} \overset{1 - rule}{\to} \{N_{i 1}, N_{i 2}, N_{i 3}\}, if S_{i 1} = 1 \end{matrix} \end{matrix}

(6)

The above equation shows that $S_{i 1}$ is the benchmark for selecting the bi-mapping rule, after the first position randomization, the proportion of 0 in the sequence ( $S_{11}, S_{21}, {\dots S}_{i 1}, \dots S_{(n 1) 1}, S_{n 1}$ ) is $\frac{\sum_{i}^{n} S_{i}}{n} \approx 50 %$ . To ensure that the mapped bases are GC balanced, under the 0-rule, it is necessary that only one of the mapped bases $\{N_{i 1}, N_{i 2}, N_{i 3}\}$ is either G or C base. Under the 1-rule, it is required that only two of the mapped bases $\{N_{i 1}, N_{i 2}, N_{i 3}\}$ are either G or C. Based on the above, the dual-rule coding table is constructed as shown in Fig. 1.

\begin{matrix} N_{i 1} = \{\begin{matrix} A, if (S_{i 2} S_{i 3}) = (00) \\ T, if (S_{i 2} S_{i 3}) = (01) \\ \binom{C, if (S_{i 2} S_{i 3}) = (11)}{G, if (S_{i 2} S_{i 3}) = (10)} \end{matrix} \end{matrix}

(7)

The first bit is not involved in the mapping under the 0-rule and 1-rule coding. Bits 2 and 3 are mapped to position 1 of the base sequence, as shown in Equation (7).

Rotary coding (Goldman et al. 2013) is highly respected as one of the most important techniques widely used in data communication and storage. The method considers GC content and homopolymer length constraints, which largely reduces the errors occurring during data storage. Thus, with the idea of rotational coding, bits 4 and 5 of the binary are mapped to positions 2 and 3 of the base sequence, and according to Fig. 1, the result of its mapping is determined by the value of itself and the previous base together.

Take the binary sequence “01101001001001001110” for example, it is first grouped in units of 5 bits, then the first bit is extracted to form the sequence “0010,” and iteratively exclusive OR with the chaotic sequence generated by the logic mapping, and the 0/1 content of the result of the exclusive OR is calculated until the 0/1 content of the sequence reaches the range of 49%–51%, then the iteration is stopped. After calculating the chaotic sequence after one iteration satisfies the above constraints, at which point the chaotic sequence is “1011” (µ, x₀). The result of the exclusive OR is “1001,” replaced by a bit back to the original position. At this point, the sequence to be encoded is “11101001000001011110.” $S_{11} = 1, S_{21} = 0, S_{31} = 0, S_{41} = 1$ , Then the remaining positions within each chunk are mapped according to the 1-, 0-, 0-, and 1-rules, respectively. The mapping result “CACTCAATCCTG” is obtained, in which the maximum homopolymer length is <3, and the GC content is 50%.

3.4 Decoding algorithms

The decoding algorithm is the inverse of the encoding algorithm, and the key to decoding is the process of segmenting the base sequence (in slices of 3) and counting the GC content within each slice.

Suppose the received base sequence is N_receive = (N₁₁, N₁₂, N₁₃, … N_n₁, N_n₂, N_n₃), N_i₁, N_i₂, N_i₃ ∈ {A, T, C, G}. Firstly, count the sum of the number of G/C $n_{i GC}$ in the ith slice $N_{i 1}, N_{i 2}, N_{i 3}$ , The value of $n_{i GC}$ is used to determine which rule is used to decode the ith slice. As shown in Equation (8), when the number of GCs in the ith slice $n_{i GC} = 1$ , it is decoded with the 0-rule, and when the number of GCs in the ith slice $n_{iGC} = 2$ , it is decoded with the 1-rule.

\begin{matrix} \{\begin{matrix} \{N_{i 1}, N_{i 2}, N_{i 3}\} \overset{0 - rule}{\to} \{S_{i 1}, S_{i 2}, S_{i 3}, S_{i 4}, S_{i 5}\}, {if n}_{i GC} = 1 \\ \{N_{i 1}, N_{i 2}, N_{i 3}\} \overset{1 - rule}{\to} \{S_{i 1}, S_{i 2}, S_{i 3}, S_{i 4}, S_{i 5}\}, {if n}_{i GC} = 2 \end{matrix} \end{matrix}

(8)

Another key to decoding is replacing the bit of the first position; after decoding by the 0, 1 inverse rule, the binary sequence is still not the original sequence, first split the binary sequence (in units of 5) and extract the first bit, exclusive OR with the chaotic sequence saved in the first position randomization process (μ, x₀) to get the original first position sequence, and then replace the original first position sequence back to the first bit of each slice to get the correct original binary sequence.

The whole decoding process can be divided into the following two steps:

Step 1: Segment the base sequences obtained from sequencing according to the fixed bit number 3, and select the decoding rules for decoding according to the value of $n_{i GC}$ .

Step 2: Split the binary sequence obtained by step 1 according to the fixed number of bits 5, extract the first bit to form the sequence ${(S}_{11}, S_{21}, \dots S_{n 1})$ , exclusive OR with the chaotic sequence to get $S_{first}$ , and replace each bit of it with $S_{i 1}$ in order.

4 DNA sequence design and simulation analysis

Figure 3 shows the structure of the DNA sequence designed in this scheme, the payload portion is 168 nt, and the left and right sides with a length of 24 nt are primers, which play a role in amplifying the target DNA fragments in the PCR, and also affect the quality of the sequencing. The Reed–Solomon (RS) code is a linear packet code that corrects coding and decoding random errors during the process and is widely used in DNA storage (Wicker and Bhargava 1999). Therefore, this scheme introduces the RS code, which is the green part of the figure with a length of 24 nt. The red part with a length of 15 nt is the index, which is used to identify the DNA sequence position.

To better evaluate DRRC storage solutions. In this article, DRRC is compared with Church e al. (2012), Goldman et al. (2013), Grass et al. (2015), DNA Fountain (Erlich and Zielinski 2019), and YYC (Ping et al. 2022) using Chamaeleo (Ping et al. 2021), a platform developed by the Beijing Genomics institution Research Institute for pooling different DNA storage methods. DNA Fountain and YYC algorithms used their original settings with constraints of 40%–60% GC content and maximum homopolymer length 4. Several coding and decoding tests were performed afterward, and detailed analyses of the DNA sequence characteristics of the different schemes mentioned above, as well as the robustness of the error propagation of the DRRC, were performed.

4.1 DNA sequence characterization

In this section, DRRC is combined with five other classical coding schemes, such as Church, Goldman, Grass, DNA Fountain, and YYC, to encode the PDF file of the English version of Wandering Earth and Pride and Prejudice, with a total of 876 960 bytes of data. In the encoding process, the binary sequence obtained from reading is divided into segments every 280 bits, totaling 25 056 segments. Each binary sequence was encoded independently in each encoding scheme to obtain 25 056 base sequences, and finally, the sequence characteristics of all base sequences were counted. Figure 4A shows the distribution of GC content under different coding schemes. This scheme can control the GC content of DNA sequences at 50% by controlling the 0/1 proportion of the first position sequence (Content0 = 50%), YYC and DNA Fountain always control the GC content at 40%–60% due to their unique screening mechanism, Church’s scheme has a GC content of 34%–66%, Goldman’s 21%–80%, and Grass’s 31%–70%. As shown in Fig. 4B, in terms of homopolymerization, the scheme of this article is comparable to Grass, Church, with most of the maximum homopolymer lengths clustered at 3 nt, and the remainder at 2 nt. DNA Fountain and YYC have a maximum homopolymer length of 3–4 nt. It is worth noting that Goldman’s maximum homopolymer length is 1 nt, in this scheme, if the current base is the same as the previous base, the current base is replaced to achieve the effect of avoiding homopolymerization, but this model increases redundancy at the expense of net information density. In summary, this solution provides coding patterns that are reliable and stable, which facilitates the subsequent DNA synthesis, PCR, and sequencing processes. This feature not only reduces the occurrence of accidental errors to a certain extent but also improves the efficiency of the data reading process (i.e. PCR and sequencing).

Figure 4. — Comparison of the properties of DNA sequences by Church, Goldman, Grass, DNA Fountain code, YYC, and DRRC. (A) Distribution of GC content. (B) Distribution of maximum homopolymers.

4.2 Analysis of error correction capacity

The DNA storage process inevitably introduces errors that pose a serious challenge to the accurate recovery of DNA data. DNA molecules may be subject to base substitution, insertion, and deletion errors during synthesis; at the PCR amplification stage, substitution errors may occur; the probability of substitution error for a single base on the 454 GS Junior sequencing platform is 0.05%, and the probability of insertion/deletion is 0.39%, the probability of substitution error for a single base on the second-generation sequencing platform Illumina MiSeq is 0.24%, and the probability of insertion/deletion is 0.009%; the probability of substitution error for a single base on the third-generation sequencing platform Pacific Biosciences RS is 1.10286%, and the probability of insertion/deletion is 15.56571% (Quail et al. 2012, Bragg et al. 2013, Ross et al. 2013). Therefore, the inclusion of error correction measures in decoding is critical to the integrity of data recovery.

\begin{matrix} R_{recovery} = \frac{N_{sdseq}}{N_{tseq}} \end{matrix}

(9)

The data recovery rate is calculated as shown in Equation (9). $N_{sdseq}$ is the number of fully recovered sequences and $N_{tseq}$ is the overall number of DNA sequences. Without incorporating any error correction measures, a Mona Lisa image of size 97.53 kb was taken as input and encoded with a dual rule to form 2316 base sequences of length 207 nt. Substitution and insertion/deletion errors of 0.1%–10% were randomly introduced in them. As shown in Fig. 6A and B, the data recovery rate is always above 80% with the introduction of 0.1%–10% substitution errors, and decreases rapidly and remains at 50% with the introduction of 0.1%–10% insertion/deletion errors. This is because the decoding of DRRC is done in slices (three bases to a slice), as shown in Fig. 5A, and substitution errors will only affect the decoding of this slice itself and will not affect the subsequent slices. Insertion/deletion errors, on the other hand, change the length of the entire sequence and propagate the error to subsequent slices. For the above three types of errors, although the robustness to system errors can be improved by introducing error correction codes such as RS code or low-density parity-check codes, the system robustness cannot be effectively improved when the error rate exceeds the error correction capability of the error correction codes, and the existing error correction codes are not effective in correcting insertion/deletion errors. Therefore, how to improve the robustness of decoding is crucial. In this article, the process of improving the robustness of decoding is divided into two parts, firstly, the Hamming distance is introduced to prevent error propagation, to eliminate most of the errors. Secondly, the errors that cannot be eliminated partially are corrected by RS code.

Figure 5B shows a schematic diagram of the first level of the error correction process, which aims to stop the error propagation. Firstly, the correct DNA sequence is partitioned into 3-base slices, at this time, the Hamming distance between each slice and the coding table code character is 0. Next, a random error is introduced, and after an insertion/deletion operation, the Hamming distance between each slice and the coding table is re-calculated. Unlike Zan et al. (2022), our slices do not necessarily mutate in their Hamming distance after introducing the insertion/deletion error, which is because our 3-base coding table has only 32 characters. There are a total of 4³ = 64 possibilities for the three bases, so after an error occurs within a slice, there is a certain chance that it will change to a character within the coding table, and if it is within the coding table, then at this point, the slice Hamming distance will still be 0. However, as the error propagates backward, the Hamming distance of the subsequent slices is bound to change. Therefore, in this article, the non-zero Hamming distance is considered the basis for locating the occurrence of the error. The first slice with non-zero Hamming distance is first found and labeled as the first error slice; for insertion errors, the first base within the slice is deleted; for deletion errors, a random base is inserted at the first position of the segment. The purpose of this error correction level is not to recover the slice where the insertion/deletion error occurred but to cut off the effect of the insertion/deletion on subsequent slices.

In summary, the first level of the error correction process is as follows:

Step 1: Split the read sequences into 3-base slices and calculate the Hamming distance between each slice and the code word in the coding table.

Step 2: Firstly, compare the length of this sequence with the standard length of our designed DNA sequence, if it is less than the standard length then assume that deletion error occurs in this sequence, if it is greater than the standard length then assume that insertion error occurs in this sequence. Secondly, find the first non-zero Hamming distance slice and determine it as the first error slice; if an insertion error occurs, the first base of the slice is deleted; if a deletion error occurs, any base is added to the first position of the slice. Finally, the entire sequence is sliced again in units of 3 bases, and the Hamming distance between each slice and the code word of the coding table is recalculated.

Step 3: After step 2, regardless of whether the Hamming distance of the first erroneous slice is restored to 0 or not, this slice is skipped, and the process of step 2 is repeated for the subsequent slices until all slices are traversed.

The RS code added to the DNA sequence was used as the second level of error correction, and to test the effect of the overall error correction scheme, 0.1%–10% mixed errors were randomly introduced into the 2316 base sequences encoded from the Mona Lisa picture with a length of 207 nt. To conform to the error pattern of the Illumina MiSeq platform, the ratio of substitutions to insertions/deletions in the mixed errors was 27:1. As shown in Fig. 6C, when the mixed error rate reaches 2%, more than 90% of the data can still be recovered. Therefore, for the Illumina MiSeq platform, which has the smallest sequencing error rate, our error correction scheme is fully competent.

4.3 Comparison with existing schemes

Table 1 lists the encoding potential, net information density, error correction scheme, GC content and maximum homopolymer length of the other six storage schemes compared to the present scheme. The calculation of encoding potential is given by Equation (10).

\begin{matrix} D_{theory} = \frac{L_{bin}}{N_{DRR C_{oligo}} \times L_{DRR C_{inoligo}} + N_{Blawa t_{oligo}} \times L_{Blawa t_{inoligo}}} \end{matrix}

(10)

Table 1.

Comparison of the DRRC storage system with the previous storage scheme.

	Input data (Mbytes)	Coding potential (bit/nt)	Net information density (bits/nt)	Error correction/detection	GC content (%) of sequence	Maximum homopolymer length (nt)
Church et al.	0.65	1	0.83	NO	2.5–100	3
Goldman et al.	0.75	1.58	0.33	Repetition	22.5–82.5	1
Grass et al.	0.08	1.78	1.14	RS	12.5–100	3
Blawat et al.	22	1.6	0.92	BCH + RS + CRC	19–71	3
Erlich and Zielinski	2.75	1.98	1.57	Fountain	40–60	4
Chen et al.	0.318	1.95	1.89	RS	40–60	4
This work	0.836	1.66	1.352	Hamming distance + RS	50	3

Open in a new tab

$L_{bin}$ represents the length of the original binary sequence, $N_{DRR C_{oligo}}$ and $N_{Blawa t_{oligo}}$ denote the number of oligonucleotide sequences encoded by the present scheme and the Blatwat scheme, respectively, and $L_{DRR C_{inoligo}}$ and $L_{Blawa t_{inoligo}}$ denote the lengths of information-carrying portions of oligonucleotide sequences encoded by the present scheme and the Blatwat scheme, respectively. Taking the example of an input size of 876 960 bytes of original data, using the above formulas, our scheme still has a coding potential of 1.66 bit/nt. The calculation of net information density should also consider the index part and error correction code part in the oligonucleotide sequences. Our scheme’s final net information density is 1.352 bit/nt.

DRRC can control DNA encoding constraints fairly precisely compared to other earlier encoding schemes, thereby reducing the complexity of subsequent synthesis and sequencing and avoiding errors to some extent. Some methods use compression of data as part of the encoding process to increase information density, and we emphasize that comparisons should not be made between methods that use compression and those that do not. Moreover, with the rapid growth of information worldwide, it is necessary to store content not only in text form but also in other formats (images, video, audio), which this scheme can do very well.

5 Conclusion

In this article, we propose a dual-rule DNA storage system DRRC based on chaotic mapping to control the GC content. A potential advantage of DRRC is that the first position of randomization is introduced to the original sequence before encoding, and the correct and complete original information can only be decoded with the corresponding chaotic sequence parameter, which improves the security of storage. The coding scheme in this article also has a good storage density, which can reach a logical storage density of 1.66 bit/nt without compression mechanism, and supports compression methods such as Huffman, arithmetic coding, etc., which significantly increases the storage capacity.

In terms of robustness, the use of mapping table coding is more or less likely to have some impact on error propagation, and DRRC is no exception. Therefore, to cope with this problem, this article introduces the concept of Hamming distance to detect error fragments and adopts the method of inserting or deleting bases to cut off the error propagation of the error fragments to the subsequent fragments. Simulations have proven that the data recovery rate is effectively improved after adding this mechanism of blocking error propagation. In this article, RS code is also incorporated into the DNA sequence design to improve data recovery. The data recovery rate can be further improved in the future by increasing the sequencing depth with multiple sequence comparisons. We believe this new storage architecture is expected to facilitate the early realization of DNA storage applications.

Acknowledgements

The authors would like to thank the other members of the lab for their constructive feedback during the discussion and writing of this manuscript. They want to express gratitude to the anonymous reviewers for their hard work and kind comments, which will further improve our work in the future.

Contributor Information

Xuncai Zhang, College of Electrical Information Engineering, Zhengzhou University of Light Industry, Zhengzhou 450000, Henan, China.

Baonan Qi, College of Electrical Information Engineering, Zhengzhou University of Light Industry, Zhengzhou 450000, Henan, China.

Ying Niu, College of Building Environment Engineering, Zhengzhou University of Light Industry, Zhengzhou 450000, Henan, China.

Conflict of interest

None declared.

Funding

This work was supported by the National Natural Science Foundation of China (62102374, 62072417).

Data availability

The source code for the codec scheme of this paper is available at https://github.com/Mooreniah/DNA-dual-rule-rotary-encoding-storage-system-DRRC.

References

Allentoft ME, Collins M, Harker D. et al. The half-life of DNA in bone: measuring decay kinetics in 158 dated fossils. Proc R Soc B 2012;279:4724–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
Anavy L, Vaknin I, Atar O. et al. Data storage in DNA with fewer synthesis cycles using composite DNA letters. Nat Biotechnol 2019;37:1229–36. [DOI] [PubMed] [Google Scholar]
Bhat WA. Bridging data-capacity gap in big data storage. Fut Gen Comput Syst 2018;87:538–48. [Google Scholar]
Blawat M, Gaedke K, Hütter I. et al. Forward error correction for DNA data storage. Proc Comput Sci 2016;80:1011–22. [Google Scholar]
Bragg LM, Stone G, Butler MK. et al. Shining a light on dark sequencing: characterising errors in Ion Torrent PGM data. PLoS Comput Biol 2013;9:e1003031. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cafferty BJ, Ten AS, Fink MJ. et al. Storage of information using small organic molecules. ACS Cent Sci 2019;5:911–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cao B, Ii X, Zhang X. et al. Designing uncorrelated address constrain for DNA storage by DMVO algorithm. IEEE/ACM Trans Comput Biol Bioinform 2022;19:866–77. [DOI] [PubMed] [Google Scholar]
Choi Y, Ryu T, Lee AC. et al. High information capacity DNA-based data storage with augmented encoding characters using degenerate bases. Sci Rep 2019;9:6582. [DOI] [PMC free article] [PubMed] [Google Scholar]
Church GM, Gao Y, Kosuri S. et al. Next-generation digital information storage in DNA. Science 2012;337:1628– [DOI] [PubMed] [Google Scholar]
De Silva PY, Ganegoda GU.. New trends of digital data storage in DNA. Biomed Res Int 2016;2016:8072463. [DOI] [PMC free article] [PubMed] [Google Scholar]
Erlich Y, Zielinski D.. DNA fountain enables a robust and efficient storage architecture. Science 2017;355:950–4. [DOI] [PubMed] [Google Scholar]
Goldman N, Bertone P, Chen S. et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 2013;494:77–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
Grass RN, Heckel R, Puddu M. et al. Robust chemical preservation of digital information on DNA in silica with error‐correcting codes. Angew Chem Int Ed Engl 2015;54:2552–5. [DOI] [PubMed] [Google Scholar]
Kennedy E, Arcadia CE, Geiser J. et al. Encoding information in synthetic metabolomes. PLoS One 2019;14:e0217364. [DOI] [PMC free article] [PubMed] [Google Scholar]
Koch J, Gantenbein S, Masania K. et al. A DNA-of-things storage architecture to create materials with embedded memory. Nat Biotechnol 2020;38:39–43. [DOI] [PubMed] [Google Scholar]
Kosuri S, Church GM.. Large-scale de novo DNA synthesis: technologies and applications. Nat Methods 2014;11:499–507. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lu F, Foh CH, Cai J et al. LT codes decoding: design and analysis. In: 2009 IEEE International Symposium on Information Theory, 2492–6. IEEE, June 2009.
Materassi D, Basso M.. Time scaling of chaotic systems: application to secure communications. Int J Bifurc Chaos 2008;18:567–75. [Google Scholar]
Niedringhaus TP, Milanova D, Kerby MB. et al. Landscape of next-generation sequencing technologies. Anal Chem 2011;83:4327–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ping Z, Zhang HL, Chen SH et al. Chamaeleo: an integrated evaluation platform for DNA storage. Synth Biol J 2021;2:412–27. [Google Scholar]
Ping Z, Chen S, Zhou G. et al. Towards practical and robust DNA-based data archiving using the yin–yang codec system. Nat Comput Sci 2022;2:234–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
Quail MA, Smith M, Coupland P. et al. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 2012;13:341. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ross MG, Russ C, Costello M. et al. Characterizing and measuring bias in sequence data. Genome Biol 2013;14:R51. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schwartz JJ, Lee C, Shendure J. et al. Accurate gene synthesis with tag-directed retrieval of sequence-verified DNA molecules. Nat Methods 2012;9:913–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shendure J, Balasubramanian S, Church GM. et al. DNA sequencing at 40: past, present and future. Nature 2017;550:345–53. [DOI] [PubMed] [Google Scholar]
Van der Verren SE, Van Gerven N, Jonckheere W. et al. A dual-constriction biological nanopore resolves homonucleotide sequences with high fidelity. Nat Biotechnol 2020;38:1415–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
Welzel M, Schwarz PM, Löchel HF. et al. DNA-Aeon provides flexible arithmetic coding for constraint adherence and error correction in DNA storage. Nat Commun 2023;14:628. [DOI] [PMC free article] [PubMed] [Google Scholar]
ΘWicker SB, Bhargava VK (eds). Reed-Solomon Codes and Their Applications . New York, United States: John Wiley & Sons, 1999. [Google Scholar]
Yazdi SMHT, Yuan Y, Ma J. et al. A rewritable, random-access DNA-based storage system. Sci Rep 2015;5:14138–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zan X, Yao X, Xu P. et al. A hierarchical error correction strategy for text DNA storage. Interdiscip Sci 2022;14:141–50. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The source code for the codec scheme of this paper is available at https://github.com/Mooreniah/DNA-dual-rule-rotary-encoding-storage-system-DRRC.

[btae113-B1] Allentoft ME, Collins M, Harker D. et al. The half-life of DNA in bone: measuring decay kinetics in 158 dated fossils. Proc R Soc B 2012;279:4724–33. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae113-B2] Anavy L, Vaknin I, Atar O. et al. Data storage in DNA with fewer synthesis cycles using composite DNA letters. Nat Biotechnol 2019;37:1229–36. [DOI] [PubMed] [Google Scholar]

[btae113-B3] Bhat WA. Bridging data-capacity gap in big data storage. Fut Gen Comput Syst 2018;87:538–48. [Google Scholar]

[btae113-B4] Blawat M, Gaedke K, Hütter I. et al. Forward error correction for DNA data storage. Proc Comput Sci 2016;80:1011–22. [Google Scholar]

[btae113-B5] Bragg LM, Stone G, Butler MK. et al. Shining a light on dark sequencing: characterising errors in Ion Torrent PGM data. PLoS Comput Biol 2013;9:e1003031. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae113-B6] Cafferty BJ, Ten AS, Fink MJ. et al. Storage of information using small organic molecules. ACS Cent Sci 2019;5:911–6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae113-B7] Cao B, Ii X, Zhang X. et al. Designing uncorrelated address constrain for DNA storage by DMVO algorithm. IEEE/ACM Trans Comput Biol Bioinform 2022;19:866–77. [DOI] [PubMed] [Google Scholar]

[btae113-B8] Choi Y, Ryu T, Lee AC. et al. High information capacity DNA-based data storage with augmented encoding characters using degenerate bases. Sci Rep 2019;9:6582. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae113-B9] Church GM, Gao Y, Kosuri S. et al. Next-generation digital information storage in DNA. Science 2012;337:1628– [DOI] [PubMed] [Google Scholar]

[btae113-B10] De Silva PY, Ganegoda GU.. New trends of digital data storage in DNA. Biomed Res Int 2016;2016:8072463. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae113-B11] Erlich Y, Zielinski D.. DNA fountain enables a robust and efficient storage architecture. Science 2017;355:950–4. [DOI] [PubMed] [Google Scholar]

[btae113-B12] Goldman N, Bertone P, Chen S. et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 2013;494:77–80. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae113-B13] Grass RN, Heckel R, Puddu M. et al. Robust chemical preservation of digital information on DNA in silica with error‐correcting codes. Angew Chem Int Ed Engl 2015;54:2552–5. [DOI] [PubMed] [Google Scholar]

[btae113-B14] Kennedy E, Arcadia CE, Geiser J. et al. Encoding information in synthetic metabolomes. PLoS One 2019;14:e0217364. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae113-B15] Koch J, Gantenbein S, Masania K. et al. A DNA-of-things storage architecture to create materials with embedded memory. Nat Biotechnol 2020;38:39–43. [DOI] [PubMed] [Google Scholar]

[btae113-B16] Kosuri S, Church GM.. Large-scale de novo DNA synthesis: technologies and applications. Nat Methods 2014;11:499–507. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae113-B17] Lu F, Foh CH, Cai J et al. LT codes decoding: design and analysis. In: 2009 IEEE International Symposium on Information Theory, 2492–6. IEEE, June 2009.

[btae113-B18] Materassi D, Basso M.. Time scaling of chaotic systems: application to secure communications. Int J Bifurc Chaos 2008;18:567–75. [Google Scholar]

[btae113-B19] Niedringhaus TP, Milanova D, Kerby MB. et al. Landscape of next-generation sequencing technologies. Anal Chem 2011;83:4327–41. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae113-B20] Ping Z, Zhang HL, Chen SH et al. Chamaeleo: an integrated evaluation platform for DNA storage. Synth Biol J 2021;2:412–27. [Google Scholar]

[btae113-B21] Ping Z, Chen S, Zhou G. et al. Towards practical and robust DNA-based data archiving using the yin–yang codec system. Nat Comput Sci 2022;2:234–42. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae113-B22] Quail MA, Smith M, Coupland P. et al. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 2012;13:341. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae113-B23] Ross MG, Russ C, Costello M. et al. Characterizing and measuring bias in sequence data. Genome Biol 2013;14:R51. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae113-B24] Schwartz JJ, Lee C, Shendure J. et al. Accurate gene synthesis with tag-directed retrieval of sequence-verified DNA molecules. Nat Methods 2012;9:913–5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae113-B25] Shendure J, Balasubramanian S, Church GM. et al. DNA sequencing at 40: past, present and future. Nature 2017;550:345–53. [DOI] [PubMed] [Google Scholar]

[btae113-B26] Van der Verren SE, Van Gerven N, Jonckheere W. et al. A dual-constriction biological nanopore resolves homonucleotide sequences with high fidelity. Nat Biotechnol 2020;38:1415–20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae113-B27] Welzel M, Schwarz PM, Löchel HF. et al. DNA-Aeon provides flexible arithmetic coding for constraint adherence and error correction in DNA storage. Nat Commun 2023;14:628. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae113-B28] ΘWicker SB, Bhargava VK (eds). Reed-Solomon Codes and Their Applications . New York, United States: John Wiley & Sons, 1999. [Google Scholar]

[btae113-B29] Yazdi SMHT, Yuan Y, Ma J. et al. A rewritable, random-access DNA-based storage system. Sci Rep 2015;5:14138–10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae113-B30] Zan X, Yao X, Xu P. et al. A hierarchical error correction strategy for text DNA storage. Interdiscip Sci 2022;14:141–50. [DOI] [PubMed] [Google Scholar]

PERMALINK

A dual-rule encoding DNA storage system using chaotic mapping to control GC content

Xuncai Zhang

Baonan Qi

Ying Niu

Roles

Abstract

Motivation

Results

1 Introduction

2 Constraints in DNA storage

2.1 GC content constraints

2.2 Run length constraints

3 DNA dual rule rotary encoding storage system

3.1 Principles and features of DRRC

Figure 1.

3.2 Randomization of the first position

Figure 2.

Figure 6.

3.3 Dual-rule DNA coding

3.4 Decoding algorithms

4 DNA sequence design and simulation analysis

Figure 3.

4.1 DNA sequence characterization

Figure 4.

4.2 Analysis of error correction capacity

Figure 5.

4.3 Comparison with existing schemes

Table 1.

5 Conclusion

Acknowledgements

Contributor Information

Conflict of interest

Funding

Data availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases