An artificial chromosome for data storage

Weigang Chen; Mingzhe Han; Jianting Zhou; Qi Ge; Panpan Wang; Xinchen Zhang; Siyu Zhu; Lifu Song; Yingjin Yuan

doi:10.1093/nsr/nwab028

. 2021 Feb 12;8(5):nwab028. doi: 10.1093/nsr/nwab028

An artificial chromosome for data storage

Weigang Chen ^1,^2,^b, Mingzhe Han ^3,^4,^b, Jianting Zhou ^5,^6,^b, Qi Ge ⁷, Panpan Wang ⁸, Xinchen Zhang ^9,¹⁰, Siyu Zhu ^11,¹², Lifu Song ^13,¹⁴, Yingjin Yuan ^15,^16,^✉

¹ School of Microelectronics, Tianjin University, Tianjin 300072, China

² Frontier Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin 300072, China

³ Frontier Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin 300072, China

⁴ SynBio Research Platform, Collaborative Innovation Center of Chemical Science and Engineering (Tianjin), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China

⁵ Frontier Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin 300072, China

⁶ SynBio Research Platform, Collaborative Innovation Center of Chemical Science and Engineering (Tianjin), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China

⁷ School of Microelectronics, Tianjin University, Tianjin 300072, China

⁸ School of Microelectronics, Tianjin University, Tianjin 300072, China

⁹ Frontier Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin 300072, China

¹⁰ SynBio Research Platform, Collaborative Innovation Center of Chemical Science and Engineering (Tianjin), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China

¹¹ Frontier Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin 300072, China

¹² SynBio Research Platform, Collaborative Innovation Center of Chemical Science and Engineering (Tianjin), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China

¹³ Frontier Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin 300072, China

¹⁴ SynBio Research Platform, Collaborative Innovation Center of Chemical Science and Engineering (Tianjin), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China

¹⁵ Frontier Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin 300072, China

¹⁶ SynBio Research Platform, Collaborative Innovation Center of Chemical Science and Engineering (Tianjin), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China

^✉

Corresponding author. E-mail: yjyuan@tju.edu.cn

Equally contributed to this work.

PMCID: PMC8288405 PMID: 34691648

Abstract

DNA digital storage provides an alternative for information storage with high density and long-term stability. Here, we report the de novo design and synthesis of an artificial chromosome that encodes two pictures and a video clip. The encoding paradigm utilizing the superposition of sparsified error correction codewords and pseudo-random sequences tolerates base insertions/deletions and is well suited to error-prone nanopore sequencing for data retrieval. The entire 254 kb sequence was 95.27% occupied by encoded data. The Transformation-Associated Recombination method was used in the construction of this chromosome from DNA fragments and necessary autonomous replication sequences. The stability was demonstrated by transmitting the data-carrying chromosome to the 100th generation. This study demonstrates a data storage method using encoded artificial chromosomes via in vivo assembly for write-once and stable replication for multiple retrievals, similar to a compact disc, with potential in economically massive data distribution.

Keywords: DNA storage, synthetic biology, indel correction, encoded DNA, artificial chromosome

A 254 kb artificial chromosome, storing two pictures and one video clip encoded with low-density parity-check codes and pseudo-random sequences, was manufactured and transmitted from generation to generation stably. The digital information residing in the chromosome was fast retrieved, tolerating base insertions and deletions in nanopore readout.

INTRODUCTION

Rapid progress in synthetic biology during the last two decades has provided powerful tools for the design and chemical synthesis of genomic DNAs with specific functions as desired [1,2]. Examples include genomic DNAs from Escherichia coli [3], Saccharomyces cerevisiae [4–6] and Mycoplasma mycoides [7] etc. Recently, a number of studies demonstrated the possibility of using DNA to store digital information instead of genetic information [8–22]. This prompted us to seek the possibility of design and synthesis of a chromosome fully dedicated to information storage.

With the development of high-throughput DNA synthesis and sequencing technologies, large-scale data storage in DNA has become feasible [9–14]. Presently, oligo-based efforts are limited by the non-uniformity of in vitro DNA amplification efficiency [14]. Artificial chromosomes introduced to live cells can self-replicate with high accuracy and low cost, which could represent a practical trend in archival storage. The idea of information storage in live cells through DNA has a long history [23]. A recent success of storing a digital movie in a population of bacteria was reported [24]. The lengths of artificial DNAs in live cells for the purpose of information storage were summarized in Table S1 and have never exceeded several thousand bases per cell [23–31].

In this study, we design and synthesize a yeast artificial chromosome (YAC) containing 254 886 bp using methods previously reported [4,5,32], allowing us to perform an in-depth evaluation of the stability of large-data-encoded DNA. Two pictures and a video clip were encoded in this chromosome using a superposition coding scheme. The stability of this artificial chromosome during yeast replication was well maintained through serial batch cultivation. In vivo assembly of the encoded artificial chromosome is analogous to burning a CD, which is a write-once action, while stable replications of the chromosome allow CD-like multiple retrievals. We thus proved the feasibility of a data storage paradigm using an artificial chromosome with a specialized encoding system.

RESULTS

A YAC of 254 886 bp, specialized for data storage, of which 95.27% was data payload, was designed and constructed as shown in Figs 1A and S1. Sparsified low-density parity-check (LDPC) codes and pseudo-random sequences are superposed to convert two pictures and a video into DNA sequences (Fig. 1B). This artificial chromosome was assembled from six DNA chunks with four autonomously replicating sequences (ARSs) to stabilize the replication. Success in construction was shown by pulsed-field gel electrophoresis (PFGE) (Fig. 1C). A portable MinION sequencer from Oxford Nanopore Technologies (ONT) was employed for rapid retrieval of the encoded data. Although noisy long reads were produced, the original files can be retrieved reliably (Figs 1D and S2).

Figure 1. — Design and assembly of a data-carrying artificial chromosome. (A) Schematic diagram of the data-carrying chromosome. Four additional ARSs were inserted at specific positions as labeled, and BB is short for building block. (B) The encoding scheme. Superposition coding with LDPC codes (R = 5/6) and pseudo-random sequences converted information sub-chunks (54 000 bits) into the DNA sequence (40 500 bp). The design is detailed in Note S1. (C) The NotI digestion of the artificial chromosome released two bands. Payload bands, 244 kb; backbone band, 10 kb. (D) The workflow of this digital data storage mode.

Superposition coding for chromosome-based DNA storage

A strategy of information coding, involving the superposition of sparsified LDPC codes and pseudo-random sequences, was chosen for this study (Fig. 2). LDPC codes are efficient block error correction codes, widely used in communications and data storage [33–35]. In our design, both binary and non-binary (NB) LDPC codes were employed (Fig. 2A). Digital files were divided into fixed-length blocks (54 000 bits for binary LDPC code and 32 256 bits for NB LDPC code). An interleaving step was introduced following LDPC coding in order to handle possible missing segments. The interleaved LDPC codewords were sparsified by mapping 4 to 5 bits (Fig. 2B) and superposed at the bit level with several carefully chosen pseudo-random sequences (called watermarks) (Fig. 2C). These pseudo-random sequences were used for indel identification and addressing, similar to the function of hidden hints in a jigsaw puzzle. Data DNA sequences were derived by transcoding (Fig. 2D), and integrated with the vector and ARSs to form a full artificial chromosome (Note S1, Fig. S3). A toy example of encoding 20 bits was illustrated in Fig. S4.

Figure 2. — Encoding scheme for chromosome-based DNA storage. (A) Error correction coding. Digital files were divided into bit blocks, denoted as m, which was multiplied by generator matrix G for LDPC and NB LDPC codewords. (B) Sparsification of codewords. The codewords were randomly interleaved. The interleaved codewords were then sparsified by converting every 4 to 5 bits according to the sparsification table. (C) The sparsified codewords were superposed with a predetermined pseudo-random sequence (called watermark) by exclusive or XOR operation. (D) The DNA sequence was obtained by transcoding every 2 bits to 1 base according to the transcoding table.

Using the aforementioned methods, we encoded 37 782-byte digital data, including two pictures and one video clip, into the artificial chromosome with a length of 254 886 bp including the YAC backbone and additional ARSs (Fig. 1A). The overall logical density (including YAC backbone) of this artificial chromosome is 1.19 bit/bp, which is similar to that of DNA Fountain (Table S2), as calculated in Note S2.

Rationale for additional ARSs: key for 50% GC-content DNA assembly

The transcoding rule that we used (Fig. 2D) resulted in ∼50% guanine-cytosine (GC) content throughout the chromosome (Fig. 3A), and a GC distribution pattern different from genetically encoded YACs (Fig. 3B). Previous studies have reported that added ARSs could raise the assembly efficiency of an artificial chromosome and its stability [36–38]. Our data-encoded DNA contains only one yeast ARS consensus sequence (ACS, WTTTAYRTTTW) (Fig. 3C). In comparison with previously assembled YACs [3,37,38], the ACS count per kb was 0.004, equal to an assembly from Synechococcus elongatus PCC7942. We thus added four ARSs accordingly (Fig. 3C). As a result, the 254 886 bp data-carrying chromosome (pHM059) was assembled from six DNA chunks, 40 kb in length each, with additional four ARSs and a pCC1-Ura YAC backbone using the Transformation-Associated Recombination (TAR) method [39]. The rate of correct assembly was 9.4% (9 of 96 clones). Control experiments with no additional ARSs resulted in zero success rate.

Figure 3. — Rationale for ARSs added to the artificial chromosome. (A) Comparison of the GC contents in different assemblies. The DNA sequence as labeled was fragmented into 300 bp, and the GC content of each fragment was calculated. The number of fragments corresponding to different GC contents was normalized by the total fragment number and plotted. (B) Comparison of the maps of different assemblies showing the distribution of various GC contents. (C) Comparison of ACS counts in different assemblies.

The storage-specific chromosome can be replicated stably with high fidelity

To test whether the data-carrying chromosome could be stably replicated in yeast, we cultured the strain yMH007 that harbors the data-carrying artificial chromosome and yMH104 that harbors an empty YAC backbone as control in liquid Synthetic Complete media without Uracil (SC-Ura). The growth rates of yMH007 and yMH104 were comparable, with doubling time being 2.7 ± 0.1 and 2.6 ± 0.1 hours (α = 0.05), respectively (Fig. S5A). Both strains were repetitively cultured for four generations (OD₆₀₀ equal to 0.1 to 1.6) in fresh media before harvested for the next experiments. Serial dilutions of cells from various generations were spotted on an SC-Ura agar plate. The results showed that yeast harboring the encoded artificial chromosome grew as robustly as the control at 30°C (Fig. 4A). Next, we quantitated the colonies on SC-Ura and 5′-FOA plates, representing the number of cells that had maintained and lost the chromosome, respectively. Their ratios were similar to those of the controls, approximating 100% from all tested generations (Figs 4B and S5B). Passages of both strains in non-selective SC media were carried out, and the result showed that the data-carrying chromosome was gradually lost in the population as usual (Fig. S6).

Figure 4. — Analyses of the growth effect and stability of the data-carrying chromosome. (A) The effect of the artificial chromosome on the growth of the host. Yeast strains yMH007 and yMH104 (control) were serially diluted and spotted on the agar plate for growth. Cells were passed for different generations as indicated in liquid media before the assay. (B) Stability of the data-carrying chromosome. Same number of cells passed for different generations as indicated were sprayed on the SC-Ura and 5'-FOA plates. The colonies on both plates were calculated and their ratios were presented. The results were representative of three independent biological experiments with the corresponding standard deviations.

BLAST search of the data-carrying DNA sequence against the National Center for Biotechnology Information (NCBI) nucleotide database revealed no homologous sequences. Interestingly, our preliminary data suggested that transcriptions on this artificial chromosome were active and that none of the 36 802 peptides detected by the data-independent acquisition (DIA) technique were coded by this chromosome. The transcriptional and translational profiles of pHM059 and its physiological impacts on the host are under further investigation in our laboratory.

The fidelity of the replication of the artificial chromosome was systematically assessed. Multiplex colony polymerase chain reactions (PCRs) were carried out using 96 clones with 20-generation intervals (20–100th, 12 clones each interval, 48 for the 100th). All bands with expected sizes could be observed on the agarose gel (Fig. S7A). The integrity of the chromosome extracted from 12 clones of the 100th generation was also evident by PFGE (Fig. S7B). Furthermore, high-throughput sequencing on Illumina HiSeq platform generated a dataset but detected no single mutation in the information chunk in any of the tested 24 samples from clones with 20-generation intervals (20–100th, three clones each interval, 12 for the 100th), consistent with a low mutation rate of yeast replication [40]. Taken together, we can safely conclude that the encoded artificial chromosome could be stably transmitted through 100 generations in selective media (SC-Ura), which is suitable for reliable information retrievals.

Fast recovery from noisy nanopore readout

The third-generation MinION sequencer was used for the attempt to retrieve the files from the artificial chromosome, potentiated by its fastness, portability and capability of long sequencing. Upon the extraction of the artificial chromosome and library preparation, long raw reads were generated by the MinION sequencer with flow cell R9.4.1 within 10 min. The recovery process from the initial noisy reads is presented in Figs 5A and S8. The raw error rate was 10.79% (Fig. 5B). A stepwise assembly and polishing process was then performed to gradually lower the error rate. Briefly, traditional read-to-read overlap detection and the overlap-layout-consensus (OLC) assembly were first carried out, using Minimap and Miniasm, tools commonly used in fast mapping and de novo genome assembly [41]. These tools have no error correction functions and thus the assembled contigs still contained many errors, especially insertions and deletions. Next, we polished coarsely assembled contigs by Rapid Consensus (RACON) program [42]. With assembly and polishing, the error rate was reduced by an order of magnitude (Fig. 5C), and interference reads were also excluded (Fig. S9). Data DNA sequences were then located and extracted based on positioning ARSs and vector sequences (Fig. S10). Insertions and deletions were identified using modified forward–backward algorithms according to these superposed pseudo-random sequences [43] and then converted into substitution errors or erasures (Fig. 5D, Note S4), which were then corrected using LDPC codes in the final step [33–35]. We rapidly recovered the original files on a laptop computer (Ubuntu 16, Intel^® core™ i7–8565U, 16 GB RAM) within 40 seconds (Movie S1). The minimal coverage for data recovery was tested with various numbers of reads ranging from 500 to 4000 using a desktop computer (Intel^® core™ i9-9900K CPU @ 3.60 GHz, 128 GB RAM) (Fig. 5E). Results indicated that minimally 16.8 × coverage, equal to 600 reads, was enough for the recovery.

Figure 5. — Error processing and data recovery. (A) Schematic flow of data recovery from nanopore sequencing reads. (B) Error rate distribution of reads from the artificial chromosome. Inset: error rate distribution by error typing. (C) Error rate distributions among six information sub-chunks after RACON polishing. Types of errors are labeled in different colors as indicated. (D) The distributions of substitution error rates among six information sub-chunks after indel identification and correction. (E) Recovery tests with various reads.

DISCUSSION

In summary, we designed and synthesized an artificial chromosome consisting of 254 886 bp, carrying data that can be reliably retrieved from noisy nanopore reads. We also demonstrated the analogy of its data storage mode to CDs, regarding write-once and multiple retrievals. In addition, the information carried by our artificial chromosome can be massively copied with low cost due to faithful DNA self-replications in live cells. Our results also demonstrated that a portable and efficient nanopore-based reading device for information retrievals from an artificial chromosome is of great potential. Currently, however, the write-once process involving in vitro chemical synthesis of DNA and in vivo chromosome assembly is still expensive, and thus continuous reduction in writing cost remains a primary concern in the field of DNA archival storage.

Information storage using artificial chromosomes and oligo pools were compared with each other in Fig. S11. The advantages of using artificial chromosomes in information storage include less bias (Fig. S12A) and lower error rate (Fig. S12B) and cost per copy as the data retrieval process from the artificial chromosome is PCR-independent. Nanopore-based reading is also faster given that the encoding method we developed tolerates errors arising from nanopore sequencing. Our coding strategy is also compatible with Illumina-based sequencing (Fig. S13), which results in lower error rates (Table S4) but is more time-consuming.

It is of great importance to note the balance between coding density and other properties [43]. For example, a robust coding system with more redundancy sacrifices information density but tolerates error-prone faster reading, and a GC-content flexible coding system sacrifices information density but improves the stability of the artificial chromosome.

We envision that multiple artificial chromosomes in live cells, all dedicated to information storage, are practically doable, given that watermark-aided data retrievals can be performed in parallel. Such parallel readouts from two chromosomes (one being real, and the other being virtual) were simulated in Fig. S14.

Methods

Design of the artificial chromosome for digital data storage

We designed an artificial chromosome consisting of biological chunks and information chunks dedicated for digital data storage. The biological chunks included a YAC backbone and four additional ARSs to stabilize the artificial chromosome (Figs 1A and S1). Of the six information chunks, binary LDPC codes were used in five for the coding of a video clip and a picture, while NB LDPC codes were used in the remaining one to store a picture (Table S3). In addition to LDPC/NB LDPC coding, other strategies including interleaving, sparsification, superposition with watermarks, and transcoding are detailed in Note S1.

Workflow of digital data storage using an artificial chromosome

The workflow of the digital storage with the artificial chromosome is divided into four steps (Figs 1B and S2). First, digital files (two pictures and a video) were mapped into data DNA sequences using the superposition coding method. Second, each data DNA sequence was decomposed into a series of sub-chunks with overlaps and outsourced for synthesis. The data DNA sequences were then assembled with additional ARSs and a YAC backbone. Third, the data-carrying chromosome was exponentially copied, extracted and sequenced. Fourth, following the proposed recovery process, digital files were fast retrieved using raw reads from the ONT MinION sequencer.

Artificial chromosome stability assays in yeast

Baker yeast S.cerevisiae strains yMH007 and yMH104 were cultivated in flasks containing SC-Ura liquid medium in a shaker incubator at 30°C at 200 rpm. The overnight cultures were then diluted in a fresh SC-Ura liquid medium until OD₆₀₀ equal to 0.1, and incubated at 30°C, until OD₆₀₀ reached 1.6. Re-dilution and re-cultivation were repetitively carried out for the passage of generations. 50 μL of serially diluted cultures were spread on SC-Ura and SC supplemented with 5′-FOA agar plates. The colony numbers on both plates formed from cells with four-generation intervals were counted. The rate of chromosome-containing cells was calculated as Num_SC-Ura/(Num_Sc-Ura+ Num_SC+5-FOA).

Yeast strain yMH007 was continuously cultivated to the 100th generation in SC-Ura medium at 30°C. Yeast colony multiplex PCRs were carried out using colonies formed by yeast cells with 20-generation intervals and primer sets 17–24 (Table S6).

DATA AVAILABILITY

The data underlying this article will be shared on reasonable request to the corresponding author.

CODE AVAILABILITY

The code is available from the corresponding author upon reasonable request.

Supplementary Material

nwab028_Supplemental_File

Click here for additional data file.^{(3.3MB, docx)}

Acknowledgements

We thank Profs Yan Zhang and Matthias Bureik for their helpful discussions and assistance in writing the manuscript. We thank Jian Na for his technical support.

Contributor Information

Weigang Chen, School of Microelectronics, Tianjin University, Tianjin 300072, China; Frontier Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin 300072, China.

Mingzhe Han, Frontier Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin 300072, China; SynBio Research Platform, Collaborative Innovation Center of Chemical Science and Engineering (Tianjin), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China.

Jianting Zhou, Frontier Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin 300072, China; SynBio Research Platform, Collaborative Innovation Center of Chemical Science and Engineering (Tianjin), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China.

Qi Ge, School of Microelectronics, Tianjin University, Tianjin 300072, China.

Panpan Wang, School of Microelectronics, Tianjin University, Tianjin 300072, China.

Xinchen Zhang, Frontier Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin 300072, China; SynBio Research Platform, Collaborative Innovation Center of Chemical Science and Engineering (Tianjin), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China.

Siyu Zhu, Frontier Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin 300072, China; SynBio Research Platform, Collaborative Innovation Center of Chemical Science and Engineering (Tianjin), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China.

Lifu Song, Frontier Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin 300072, China; SynBio Research Platform, Collaborative Innovation Center of Chemical Science and Engineering (Tianjin), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China.

Yingjin Yuan, Frontier Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin 300072, China; SynBio Research Platform, Collaborative Innovation Center of Chemical Science and Engineering (Tianjin), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China.

FUNDING

This work was supported by the National Natural Science Foundation of China (21621004).

AUTHOR CONTRIBUTIONS

W.C., J.Z. and Y.Y. conceived the study and designed the experiments; W.C., P.W and Q.G. designed the large DNA encoding and recovery methods then wrote the software. M.H. and Y.Y. designed the biological part of the large DNA. M.H., J.Z., X.Z., S.Z. and L.S. performed assembling design and experiments; M.H., J.Z. and W.C. performed nanopore sequencing and analyzed data; W.C., M.H., J.Z. and Y.Y. wrote the paper.

Conflict of interest statement. None declared.

REFERENCES

1. Hughes RA, Ellington AD.. Synthetic DNA synthesis and assembly: putting the synthetic in synthetic biology. Cold Spring Harb Perspect Biol 2017; 9: a023812. 10.1101/cshperspect.a023812 [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Benner SA, Sismour AM.. Synthetic biology. Nat Rev Genet 2005; 6: 533–43. 10.1038/nrg1637 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Fredens J, Wang K, de la Torre Det al. Total synthesis of Escherichia coli with a recoded genome. Nature 2019; 569: 514–8. 10.1038/s41586-019-1192-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Wu Y, Li B-Z, Zhao Met al. Bug mapping and fitness testing of chemically synthesized chromosome X. Science 2017; 355: eaaf4706. 10.1126/science.aaf4706 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Xie Z-X, Li B-Z, Mitchell LAet al. ‘Perfect’ designer chromosome V and behavior of a ring derivative. Science 2017; 355: eaaf4704. 10.1126/science.aaf4704 [DOI] [PubMed] [Google Scholar]
6. Shen Y, Wang Y, Chen Tet al. Deep functional analysis of synII, a 770-kilobase synthetic yeast chromosome. Science 2017; 355: eaaf4791. 10.1126/science.aaf4791 [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Gibson DG, Glass JI, Lartigue Cet al. Creation of a bacterial cell controlled by a chemically synthesized genome. Science 2010; 329: 52–6. 10.1126/science.1190719 [DOI] [PubMed] [Google Scholar]
8. Ceze L, Nivala J, Strauss K.. Molecular digital data storage using DNA. Nat Rev Genet 2019; 20: 456–66. 10.1038/s41576-019-0125-3 [DOI] [PubMed] [Google Scholar]
9. Church GM, Gao Y, Kosuri S.. Next-generation digital information storage in DNA. Science 2012; 337: 1628. 10.1126/science.1226355 [DOI] [PubMed] [Google Scholar]
10. Goldman N, Bertone P, Chen SYet al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 2013; 494: 77–80. 10.1038/nature11875 [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Erlich Y, Zielinski D. DNA Fountain enables a robust and efficient storage architecture. Science 2017; 355: 950–3. 10.1126/science.aaj2038 [DOI] [PubMed] [Google Scholar]
12. Yazdi SMHT, Gabrys R, Milenkovic O.. Portable and error-free DNA-based data storage. Sci Rep 2017; 7: 5011. 10.1038/s41598-017-05188-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Organick L, Ang SD, Chen YJet al. Random access in large-scale DNA data storage. Nat Biotechnol 2018; 36: 242–8. 10.1038/nbt.4079 [DOI] [PubMed] [Google Scholar]
14. Organick L, Chen YJ, Ang SDet al. Probing the physical limits of reliable DNA data retrieval. Nat Commun 2020; 11: 616. 10.1038/s41467-020-14319-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Meiser LC, Antkowiak PL, Koch Jet al. Reading and writing digital data in DNA. Nat Protoc 2020; 15: 86–101. 10.1038/s41596-019-0244-5 [DOI] [PubMed] [Google Scholar]
16. Tabatabaei SK, Wang B, Athreya NBMet al. DNA punch cards for storing data on native DNA sequences via enzymatic nicking. Nat Commun 2020; 11: 1742. 10.1038/s41467-020-15588-z [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Lopez R, Chen Y-J, Ang SDet al. DNA assembly for nanopore data storage readout. Nat Commun 2019; 10: 2933. 10.1038/s41467-019-10978-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Blawat M, Gaedke K, Huetter Iet al. Forward error correction for DNA data storage. Procedia Comput Sci 2016; 80: 1011–22. 10.1016/j.procs.2016.05.398 [DOI] [Google Scholar]
19. Dong Y, Sun F, Ping Zet al. DNA storage: research landscape and future prospects. Natl Sci Rev 2020; 7: 1092–107. 10.1093/nsr/nwaa007 [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Ping Z, Ma D, Huang Xet al. Carbon-based archiving: current progress and future prospects of DNA-based data storage. GigaScience 2019; 8: giz075. 10.1093/gigascience/giz075 [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Chen W, Huang G, Li Bet al. DNA information storage for audio and video files (in Chinese). SCIENTIA SINICA Vitae 2020; 50: 81–5. 10.1360/SSV-2019-0211 [DOI] [Google Scholar]
22. Zhirnov V, Zadegan RM, Sandhu GSet al. Nucleic acid memory. Nat Mater 2016; 15: 366–70. 10.1038/nmat4594 [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Davis J. Microvenus. Art J 1996; 55: 70–4. 10.1080/00043249.1996.10791743 [DOI] [Google Scholar]
24. Shipman SL, Nivala J, Macklis JDet al. CRISPR–Cas encoding of a digital movie into the genomes of a population of living bacteria. Nature 2017; 547: 345–9. 10.1038/nature23017 [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Hao M, Qiao H, Gao Yet al. A mixed culture of bacterial cells enables an economic DNA storage on a large scale. Commun Biol 2020; 3: 416. 10.1038/s42003-020-01141-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Nguyen HH, Park J, Park SJet al. Long-term stability and integrity of plasmid-based DNA data storage. Polymers 2018; 10: 28. 10.3390/polym10010028 [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Bancroft C, Bowler T, Bloom Bet al. Long-term storage of information in DNA. Science 2001; 293: 1763–5. 10.1126/science.293.5536.1763c [DOI] [PubMed] [Google Scholar]
28. Wong PC, Wong KK, Foote H.. Organic data memory using the DNA approach. Commun Acm 2003; 46: 95–8. 10.1145/602421.602426 [DOI] [Google Scholar]
29. Ailenberg M, Rotstein OD.. An improved Huffman coding method for archiving text, images, and music characters in DNA. Biotechniques 2009; 47: 747–51. 10.2144/000113218 [DOI] [PubMed] [Google Scholar]
30. Gustafsson C. For anyone who ever said there's no such thing as a poetic gene. Nature 2009; 458: 703. 10.1038/458703a19360064 [DOI] [Google Scholar]
31. Yachie N, Sekiyama K, Sugahara Jet al. Alignment-based approach for durable data storage into living organisms. Biotechnol Prog 2007; 23: 501–5. 10.1021/bp060261y [DOI] [PubMed] [Google Scholar]
32. Lin Q, Jia B, Mitchell LAet al. RADOM, an efficient in vivo method for assembling designed DNA fragments up to 10 kb long in Saccharomyces cerevisiae. Acs Synth Biol 2015; 4: 213–20. 10.1021/sb500241e [DOI] [PubMed] [Google Scholar]
33. Gallager R. Low-density parity-check codes. IRE Trans Inf Theory 1962; 8: 21–8. 10.1109/TIT.1962.1057683 [DOI] [Google Scholar]
34. MacKay DJ, Neal RM.. Near Shannon limit performance of low density parity check codes. Electron Lett 1997; 33: 457–8. 10.1049/el:19970362 [DOI] [Google Scholar]
35. Davey MC, MacKay D. Low-density parity check codes over GF(q). IEEE Commun Lett 1998; 2: 165–7. 10.1109/4234.681360 [DOI] [Google Scholar]
36. Tagwerker C, Dupont CL, Karas BJet al. Sequence analysis of a complete 1.66 Mb Prochlorococcus marinus MED4 genome cloned in yeast. Nucleic Acids Res 2012; 40: 10375–83. 10.1093/nar/gks823 [DOI] [PMC free article] [PubMed] [Google Scholar]
37. Noskov VN, Karas BJ, Young Let al. Assembly of large, high G+ C bacterial DNA fragments in yeast. Acs Synth Biol 2012; 1: 267–73. 10.1021/sb3000194 [DOI] [PubMed] [Google Scholar]
38. Karas BJ, Molparia B, Jablanovic Jet al. Assembly of eukaryotic algal chromosomes in yeast. J Biol Eng 2013; 7: 30. 10.1186/1754-1611-7-30 [DOI] [PMC free article] [PubMed] [Google Scholar]
39. Kouprina N, Larionov V.. Selective isolation of genomic loci from complex genomes by transformation-associated recombination cloning in the yeast Saccharomyces cerevisiae. Nat Protoc 2008; 3: 371–7. 10.1038/nprot.2008.5 [DOI] [PubMed] [Google Scholar]
40. Zhu YO, Siegal ML, Hall DWet al. Precise estimates of mutation rate and spectrum in yeast. Proc Natl Acad Sci USA 2014; 111: E2310–8. 10.1073/pnas.1323011111 [DOI] [PMC free article] [PubMed] [Google Scholar]
41. Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 2016; 32: 2103–10. 10.1093/bioinformatics/btw152 [DOI] [PMC free article] [PubMed] [Google Scholar]
42. Vaser R, Sović I, Nagarajan Net al. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res 2017; 27: 737–46. 10.1101/gr.214270.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
43. Davey MC, MacKay DJ.. Reliable communication over channels with insertions, deletions, and substitutions. IEEE Trans Inf Theory 2001; 47: 687–98. 10.1109/18.910582 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

nwab028_Supplemental_File

Click here for additional data file.^{(3.3MB, docx)}

Data Availability Statement

The data underlying this article will be shared on reasonable request to the corresponding author.

[bib1] 1. Hughes RA, Ellington AD.. Synthetic DNA synthesis and assembly: putting the synthetic in synthetic biology. Cold Spring Harb Perspect Biol 2017; 9: a023812. 10.1101/cshperspect.a023812 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2. Benner SA, Sismour AM.. Synthetic biology. Nat Rev Genet 2005; 6: 533–43. 10.1038/nrg1637 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3. Fredens J, Wang K, de la Torre Det al. Total synthesis of Escherichia coli with a recoded genome. Nature 2019; 569: 514–8. 10.1038/s41586-019-1192-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4. Wu Y, Li B-Z, Zhao Met al. Bug mapping and fitness testing of chemically synthesized chromosome X. Science 2017; 355: eaaf4706. 10.1126/science.aaf4706 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5. Xie Z-X, Li B-Z, Mitchell LAet al. ‘Perfect’ designer chromosome V and behavior of a ring derivative. Science 2017; 355: eaaf4704. 10.1126/science.aaf4704 [DOI] [PubMed] [Google Scholar]

[bib6] 6. Shen Y, Wang Y, Chen Tet al. Deep functional analysis of synII, a 770-kilobase synthetic yeast chromosome. Science 2017; 355: eaaf4791. 10.1126/science.aaf4791 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7. Gibson DG, Glass JI, Lartigue Cet al. Creation of a bacterial cell controlled by a chemically synthesized genome. Science 2010; 329: 52–6. 10.1126/science.1190719 [DOI] [PubMed] [Google Scholar]

[bib8] 8. Ceze L, Nivala J, Strauss K.. Molecular digital data storage using DNA. Nat Rev Genet 2019; 20: 456–66. 10.1038/s41576-019-0125-3 [DOI] [PubMed] [Google Scholar]

[bib9] 9. Church GM, Gao Y, Kosuri S.. Next-generation digital information storage in DNA. Science 2012; 337: 1628. 10.1126/science.1226355 [DOI] [PubMed] [Google Scholar]

[bib10] 10. Goldman N, Bertone P, Chen SYet al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 2013; 494: 77–80. 10.1038/nature11875 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] 11. Erlich Y, Zielinski D. DNA Fountain enables a robust and efficient storage architecture. Science 2017; 355: 950–3. 10.1126/science.aaj2038 [DOI] [PubMed] [Google Scholar]

[bib12] 12. Yazdi SMHT, Gabrys R, Milenkovic O.. Portable and error-free DNA-based data storage. Sci Rep 2017; 7: 5011. 10.1038/s41598-017-05188-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] 13. Organick L, Ang SD, Chen YJet al. Random access in large-scale DNA data storage. Nat Biotechnol 2018; 36: 242–8. 10.1038/nbt.4079 [DOI] [PubMed] [Google Scholar]

[bib14] 14. Organick L, Chen YJ, Ang SDet al. Probing the physical limits of reliable DNA data retrieval. Nat Commun 2020; 11: 616. 10.1038/s41467-020-14319-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] 15. Meiser LC, Antkowiak PL, Koch Jet al. Reading and writing digital data in DNA. Nat Protoc 2020; 15: 86–101. 10.1038/s41596-019-0244-5 [DOI] [PubMed] [Google Scholar]

[bib16] 16. Tabatabaei SK, Wang B, Athreya NBMet al. DNA punch cards for storing data on native DNA sequences via enzymatic nicking. Nat Commun 2020; 11: 1742. 10.1038/s41467-020-15588-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] 17. Lopez R, Chen Y-J, Ang SDet al. DNA assembly for nanopore data storage readout. Nat Commun 2019; 10: 2933. 10.1038/s41467-019-10978-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] 18. Blawat M, Gaedke K, Huetter Iet al. Forward error correction for DNA data storage. Procedia Comput Sci 2016; 80: 1011–22. 10.1016/j.procs.2016.05.398 [DOI] [Google Scholar]

[bib19] 19. Dong Y, Sun F, Ping Zet al. DNA storage: research landscape and future prospects. Natl Sci Rev 2020; 7: 1092–107. 10.1093/nsr/nwaa007 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] 20. Ping Z, Ma D, Huang Xet al. Carbon-based archiving: current progress and future prospects of DNA-based data storage. GigaScience 2019; 8: giz075. 10.1093/gigascience/giz075 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] 21. Chen W, Huang G, Li Bet al. DNA information storage for audio and video files (in Chinese). SCIENTIA SINICA Vitae 2020; 50: 81–5. 10.1360/SSV-2019-0211 [DOI] [Google Scholar]

[bib22] 22. Zhirnov V, Zadegan RM, Sandhu GSet al. Nucleic acid memory. Nat Mater 2016; 15: 366–70. 10.1038/nmat4594 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] 23. Davis J. Microvenus. Art J 1996; 55: 70–4. 10.1080/00043249.1996.10791743 [DOI] [Google Scholar]

[bib24] 24. Shipman SL, Nivala J, Macklis JDet al. CRISPR–Cas encoding of a digital movie into the genomes of a population of living bacteria. Nature 2017; 547: 345–9. 10.1038/nature23017 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] 25. Hao M, Qiao H, Gao Yet al. A mixed culture of bacterial cells enables an economic DNA storage on a large scale. Commun Biol 2020; 3: 416. 10.1038/s42003-020-01141-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] 26. Nguyen HH, Park J, Park SJet al. Long-term stability and integrity of plasmid-based DNA data storage. Polymers 2018; 10: 28. 10.3390/polym10010028 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] 27. Bancroft C, Bowler T, Bloom Bet al. Long-term storage of information in DNA. Science 2001; 293: 1763–5. 10.1126/science.293.5536.1763c [DOI] [PubMed] [Google Scholar]

[bib28] 28. Wong PC, Wong KK, Foote H.. Organic data memory using the DNA approach. Commun Acm 2003; 46: 95–8. 10.1145/602421.602426 [DOI] [Google Scholar]

[bib29] 29. Ailenberg M, Rotstein OD.. An improved Huffman coding method for archiving text, images, and music characters in DNA. Biotechniques 2009; 47: 747–51. 10.2144/000113218 [DOI] [PubMed] [Google Scholar]

[bib30] 30. Gustafsson C. For anyone who ever said there's no such thing as a poetic gene. Nature 2009; 458: 703. 10.1038/458703a19360064 [DOI] [Google Scholar]

[bib31] 31. Yachie N, Sekiyama K, Sugahara Jet al. Alignment-based approach for durable data storage into living organisms. Biotechnol Prog 2007; 23: 501–5. 10.1021/bp060261y [DOI] [PubMed] [Google Scholar]

[bib32] 32. Lin Q, Jia B, Mitchell LAet al. RADOM, an efficient in vivo method for assembling designed DNA fragments up to 10 kb long in Saccharomyces cerevisiae. Acs Synth Biol 2015; 4: 213–20. 10.1021/sb500241e [DOI] [PubMed] [Google Scholar]

[bib33] 33. Gallager R. Low-density parity-check codes. IRE Trans Inf Theory 1962; 8: 21–8. 10.1109/TIT.1962.1057683 [DOI] [Google Scholar]

[bib34] 34. MacKay DJ, Neal RM.. Near Shannon limit performance of low density parity check codes. Electron Lett 1997; 33: 457–8. 10.1049/el:19970362 [DOI] [Google Scholar]

[bib35] 35. Davey MC, MacKay D. Low-density parity check codes over GF(q). IEEE Commun Lett 1998; 2: 165–7. 10.1109/4234.681360 [DOI] [Google Scholar]

[bib36] 36. Tagwerker C, Dupont CL, Karas BJet al. Sequence analysis of a complete 1.66 Mb Prochlorococcus marinus MED4 genome cloned in yeast. Nucleic Acids Res 2012; 40: 10375–83. 10.1093/nar/gks823 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib37] 37. Noskov VN, Karas BJ, Young Let al. Assembly of large, high G+ C bacterial DNA fragments in yeast. Acs Synth Biol 2012; 1: 267–73. 10.1021/sb3000194 [DOI] [PubMed] [Google Scholar]

[bib38] 38. Karas BJ, Molparia B, Jablanovic Jet al. Assembly of eukaryotic algal chromosomes in yeast. J Biol Eng 2013; 7: 30. 10.1186/1754-1611-7-30 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib39] 39. Kouprina N, Larionov V.. Selective isolation of genomic loci from complex genomes by transformation-associated recombination cloning in the yeast Saccharomyces cerevisiae. Nat Protoc 2008; 3: 371–7. 10.1038/nprot.2008.5 [DOI] [PubMed] [Google Scholar]

[bib40] 40. Zhu YO, Siegal ML, Hall DWet al. Precise estimates of mutation rate and spectrum in yeast. Proc Natl Acad Sci USA 2014; 111: E2310–8. 10.1073/pnas.1323011111 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib41] 41. Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 2016; 32: 2103–10. 10.1093/bioinformatics/btw152 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib42] 42. Vaser R, Sović I, Nagarajan Net al. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res 2017; 27: 737–46. 10.1101/gr.214270.116 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib43] 43. Davey MC, MacKay DJ.. Reliable communication over channels with insertions, deletions, and substitutions. IEEE Trans Inf Theory 2001; 47: 687–98. 10.1109/18.910582 [DOI] [Google Scholar]

PERMALINK

An artificial chromosome for data storage

Weigang Chen

Mingzhe Han

Jianting Zhou

Qi Ge

Panpan Wang

Xinchen Zhang

Siyu Zhu

Lifu Song

Yingjin Yuan

Abstract

INTRODUCTION

RESULTS

Figure 1.

Superposition coding for chromosome-based DNA storage

Figure 2.

Rationale for additional ARSs: key for 50% GC-content DNA assembly

Figure 3.

The storage-specific chromosome can be replicated stably with high fidelity

Figure 4.

Fast recovery from noisy nanopore readout

Figure 5.

DISCUSSION

Methods

Design of the artificial chromosome for digital data storage

Workflow of digital data storage using an artificial chromosome

Artificial chromosome stability assays in yeast

DATA AVAILABILITY

CODE AVAILABILITY

Supplementary Material

Acknowledgements

Contributor Information

FUNDING

AUTHOR CONTRIBUTIONS

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases