Skip to main content
F1000Research logoLink to F1000Research
. 2018 Jan 10;7:39. [Version 1] doi: 10.12688/f1000research.13482.1

Improving communication for interdisciplinary teams working on storage of digital information in DNA

Emily E Hesketh 1, Jossy Sayir 2, Nick Goldman 2,a
PMCID: PMC5883387  PMID: 29707196

Abstract

Close collaboration between specialists from diverse backgrounds and working in different scientific domains is an effective strategy to overcome challenges in areas that interface between biology,

chemistry, physics and engineering. Communication in such collaborations can itself be challenging.  Even when projects are successfully concluded, resulting publications — necessarily multi-authored — have the potential to be disjointed. Few, both in the field and outside, may be able to fully understand the work as a whole. This needs to be addressed to facilitate efficient working, peer review, accessibility and impact to larger audiences. We are an interdisciplinary team working in a nascent scientific area, the repurposing of DNA as a storage medium for digital information. In this note, we highlight some of the difficulties that arise from such collaborations and outline our efforts to improve communication through a glossary and a controlled vocabulary and accessibility via short plain-language summaries. We hope to stimulate early discussion within this emerging field of how our community might improve the description and presentation of our work to facilitate clear communication within and between research groups and increase accessibility to those not familiar with our respective fields — be it molecular biology, computer science, information theory or others that might become relevant in future. To enable an open and inclusive discussion we have created a glossary and controlled vocabulary as a cloud-based shared document and we invite other scientists to critique our suggestions and contribute their own ideas.

Keywords: DNA-storage, digital information storage in DNA, synthetic biology, glossary, communication, controlled vocabulary, short plain-language summaries, interdisciplinary collaboration

Introduction

As we tackle increasingly complex issues throughout science, a breadth of knowledge is often necessary to devise novel solutions — something frequently achieved through interdisciplinary collaborations. The inherent diversity within interdisciplinary teams stimulates knowledge exchange, creativity or even a change in perspective; however, it can be very challenging. We work within an emerging field in synthetic biology, repurposing DNA as a storage medium for digital information. Advancing from early proof-of-principle studies in the high-throughput era 1, 2 (see references therein for historical perspective) towards a more reliable, refined and functional large-scale DNA storage system 3, 4 raises unique challenges that can only be resolved through a broad collaborative effort between biochemical and DNA sequencing specialists, computer and molecular scientists, information theorists and others. This body of research has gained considerable interest both within the research community and with the public, and this has further emphasised the need to address our communication and the presentation of our work.

Interdisciplinary teams make significant advances in life sciences

Intersection between these fields is clearly beneficial. Information theory has already underpinned many advances in life sciences, from adapting Levenshtein coding to create error-correcting molecular barcodes used in multiplexed DNA sequencing 5 to Burrows-Wheeler transformation of reference genomes implemented in several short read aligners 68. A molecular biologist may see the process of storing information in DNA as a very physical process, progressing from DNA synthesis (writing) to amplification (copying) to sequencing (reading). To an information theorist, this is a noisy channel: a series of transformations through which information is transmitted and the outputs observed. Differences in the way experts in these different fields describe their data and results can hinder collaboration and restrict impact. As a result, publications have the potential to be an ineffective hybrid of accepted nomenclature and data presentation within the intersecting fields with few readers, both in the team and outside, able to fully understand the publication as a whole.

Unambiguous communication can be challenging and misunderstandings can pass unnoticed

Unsurprisingly, common nomenclature between the intersecting disciplines has disparate meanings. Use of the word ‘qubit’ can lead you to believe that some DNA needs quantifying 9 or you may be discussing quantum information or quantum field theory 10. This complicates communication; misunderstandings have the potential to pass unnoticed, only becoming apparent downstream. Examples of such misunderstandings are the use of the words errors, erasures, and substitutions when retrieving data through DNA sequencing. To an information theorist, an ‘error’ refers to a falsely read symbol, for example when an A in the DNA sequence is falsely read as a C, distinct from an insertion or deletion. An ‘erasure’ would be a read that was possibly so uncertain that it is neither called as an A, C, G or T, but distinct from a ‘deletion’ in that the read is not simply missed but we are made aware that there is a missing symbol at this position in the DNA string. An ‘insertion’ is a symbol read, when no symbol should exist. To a molecular biologist and DNA sequencing expert, all of these would be described as read ‘errors’. To them, errors in the information theoretic sense would be called substitutions.

A glossary and controlled vocabulary for DNA-storage

DNA-storage has become a popular research field, with a number of interdisciplinary teams forming and collaborating in an attempt to make viable information storage systems that capitalise on DNA’s numerous advantages 11. To alleviate confusion and improve daily communication within and between these groups we propose, and have begun to implement, two measures: a glossary and a controlled vocabulary.

Glossary

We have created a glossary defining basic terms in molecular biology, information theory and computer science etc. that are relevant to DNA-storage, for those unfamiliar with one or more of these disciplines. This proved to be a useful aid in early discussions within our team and helped to identify areas of nomenclature ambiguity which if not addressed may have complicated communication downstream. We have already experienced the advantages of sharing this within our team and with collaborators to facilitate exchange of ideas with them.

Our glossary is held on a cloud storage system, and can be found at https://goo.gl/x6B73Q or https://rebrand.ly/dna-storage-glossary. To allow an open and inclusive discussion of how we might improve communication within this emerging community, we encourage others to critique and contribute to the glossary. The document permits “Suggestions” (proposed edits) and “Comments” to be added, and we will review these regularly and update the document as a resource for our research community.

Controlled vocabulary

Leading on from this, we are developing an evolving controlled vocabulary allowing team members to communicate precisely. This has been particularly beneficial during technical discussions — for instance, to us data packet refers to part of a DNA sequence that decodes to digital information, and excludes parts that are designed to facilitate DNA sequencing or indexing.

Use of a controlled vocabulary is something that the community may wish to agree upon. For example, one question we pose is — what should we name these DNA sequences that encode digital information? Following the practice of genome scientists, we initially called collections of such DNA sequences libraries. However, working with such samples caused confusion with our colleagues in a molecular biology laboratory: in a Next Generation Sequencing context, the term library is commonly used to describe DNA fragments that have been prepared for DNA sequencing. We now propose to refer to DNA sequences that store digital information as inDNA (for ‘ information-carrying DNA’). To refer to inDNA prepared for DNA sequencing, we can now unambiguously talk about a library of inDNA.

We would like to invite others to contribute to the development of a controlled vocabulary so that we might be able to communicate more precisely. We have included a few entries within our glossary document.

Improving review, accessibility and impact of interdisciplinary publications

We now pose another question — how might we improve data description and presentation to increase accessibility and facilitate peer review and reproducibility? Peer review is crucial within the scientific community, but this quality improvement process may not be fully realised in interdisciplinary publications. We have experienced difficulties with peer review of publications related to DNA-storage applications, as authors of work under review, as reviewers ourselves, in our assessment of others’ reviews, and in dealings with journal editors. Often the expertise is not available, or reviewers may only evaluate limited aspects of the paper. The body of work may not be effectively reviewed as a whole, leaving authors without vital feedback and potentially leading to publication of flawed work.

Presentation can be improved by including a short plain-language summary

The concept of standardising presentation of data and methods is not a novel idea in the life sciences, with ‘minimum information’ standards ensuring that publications contain the information necessary to interpret the experimental data. These are typically technique- or study-specific, e.g. MIAME (microarray experiments) 12, MIQE (quantitative polymerase chain reaction) 13 and MIFlowCyt (flow cytometry) 14. Such an approach may not be appropriate to publications relating to DNA-storage applications for some time, as these typically encompass a number of disciplines, each with its own established data description standards and many of which use rapidly changing technologies. It is not appropriate or practical to standardise such a diverse range of technologies and disciplines. Rather we should respect the accepted discipline norms, blending these together to permit DNA-storage standards to evolve.

Even publications that sit predominantly within a single discipline may be of interest to those unfamiliar with that discipline and benefit from the inclusion of a whole-paper plain-language summary. As standard with plain-language summaries this should simply report the basic rational, methodology and main findings. Box 1 is a whole-publication plain-language summary of 2 that we have written as an example.

Box 1. Plain-language summary of 2.

With the amount of digital information that needs to be stored growing exponentially there is a need to develop new ways of storing information. High information capacity, longevity and constant improvements in technologies that allow writing, copying and reading make DNA an attractive medium for storing digital information. Here we present a scalable reliable method for storing digital information in DNA.

The original bytes of several computer files in various formats were encoded into DNA as follows. A Huffman code was used to compress each byte, depending upon occurrence frequency, into a block of 5–6 trits, which are the characters 0, 1 or 2 (just as bits are 0 or 1). A reference table of these blocks and corresponding nucleotide sequences was created, with each block having four possible nucleotide combination representations. Nucleotide combinations were selected depending also upon the previous block, in a manner that prevented the occurrence of any repeating nucleotides (e.g. AA), as these are known to cause downstream copying and reading problems. Following encoding the digital information was represented as 153,335 DNA sequences of length 117 nucleotides, each containing an index and a simple error checkpoint in addition to encoding part of the original digital information. These DNA sequences were printed as a pool of DNA, containing ~1.2 × 10 7 copies of each sequence, which was copied via PCR and prepared for reading via DNA sequencing before being decoded (encoding strategy reversed).

Data totalling 739 kilobytes was successfully encoded into DNA, printed, copied, read and decoded with 100% accuracy. A storage density of ~2.2PB g −1 DNA was achieved.

It may also be useful to provide a plain-language summary of a specific technical aspect of a publication. For example, a molecular scientist may not understand the details of a complex mathematical algorithm (and nor should the description be altered specifically to allow them to), but an appreciation of how the output impacts aspects of the project relevant to them may be sufficient. We illustrate this using a paragraph from 4 (from p.5, Methods — Address Design and Encoding). This was read and discussed by the first two co-authors of the present paper, EEH and JS. Figure 1 highlights terms that either EEH, a molecular biologist (purple shading), or JS, an information theorist (yellow shading), found difficult to understand. Joining forces and explaining all terms to each other, they were able to understand the paragraph in depth.

Figure 1. Sample paragraphs from 4.

Figure 1.

Terms that may not be clear to non-specialists in particular fields are highlighted in purple and yellow, corresponding to those causing problems for a molecular biologist and an information theorist, respectively. (Used under the Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0).

As the interdisciplinary field of DNA-storage evolves towards maturity, there will be an increasing requirement for researchers from different backgrounds to understand publications without having access to colleagues from unfamiliar subject areas. This can be achieved in part by including brief summaries, which may make use of our glossary document, in specialised sections of a publications such that they become accessible for researchers from all disciplines.

Conclusions

We promote the value of interdisciplinary, collaborative science to solve complex problems, including in our field of digital information storage in DNA which combines molecular biology, information theory and computer science. We note the problems that this approach can generate in communication within and between research teams, and propose to reduce these in the DNA-storage area by initiating a glossary and controlled vocabulary. These have been made available to the research community for reference and critique, and we invite contributions to extend their scope.

Acknowledgements

We would like to thank all participants at the IARPA meeting in Washington D.C. on 27–28 April 2016 ( https://www.src.org/calendar/e006043/) for an interdisciplinary discussion, during which the need for a unified vocabulary to foster understanding within this new field was in evidence. We thank in particular Luis Ceze who chaired this discussion. This provided additional motivation for continuing and extending the glossary we had already put together, as reported during the meeting, and for writing this paper.

Funding Statement

EEH and JS are supported by the UK's Biotechnology and Biological Sciences Research Council (BBSRC grants BB/L023741/1 and BB/L021994/1). NG is supported by the European Molecular Biology Laboratory.

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

[version 1; referees: 2 approved]

References

  • 1. Church GM, Gao Y, Kosuri S: Next-generation digital information storage in DNA. Science. 2012;337(6102):1628. 10.1126/science.1226355 [DOI] [PubMed] [Google Scholar]
  • 2. Goldman N, Bertone P, Chen S, et al. : Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature. 2013;494(7435):77–80. 10.1038/nature11875 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Bornholt J, Lopez R, Carmean DM, et al. : A DNA-based archival storage system.In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems ASPLOS ’16, New York, NY, USA, ACM.2016;44(2):637–649. 10.1145/2980024.2872397 [DOI] [Google Scholar]
  • 4. Yazdi SM, Yuan Y, Ma J, et al. : A Rewritable, Random-Access DNA-Based Storage System. Sci Rep. 2015;5: 14138. 10.1038/srep14138 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Buschmann T, Bystrykh LV: Levenshtein error-correcting barcodes for multiplexed DNA sequencing. BMC Bioinformatics. 2013;14:272. 10.1186/1471-2105-14-272 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Langmead B, Trapnell C, Pop M, et al. : Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25. 10.1186/gb-2009-10-3-r25 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754–1760. 10.1093/bioinformatics/btp324 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Li R, Yu C, Li Y, et al. : SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009;25(15):1966–1967. 10.1093/bioinformatics/btp336 [DOI] [PubMed] [Google Scholar]
  • 9. Mardis E, McCombie WR: Library Quantification: Fluorometric Quantitation of Double-Stranded or Single-Stranded DNA Samples Using the Qubit System. Cold Spring Harb Protoc. 2017;2017(6):pdb.prot094730. 10.1101/pdb.prot094730 [DOI] [PubMed] [Google Scholar]
  • 10. Schumacher B: Quantum coding. Phys Rev A. 1995;51(4):2738–2747. 10.1103/PhysRevA.51.2738 [DOI] [PubMed] [Google Scholar]
  • 11. Zhirnov V, Zadegan RM, Sandhu GS, et al. : Nucleic acid memory. Nat Mater. 2016;15(4):366–370. 10.1038/nmat4594 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Brazma A, Hingamp P, Quackenbush J, et al. : Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet. 2001;29(4):365–371. 10.1038/ng1201-365 [DOI] [PubMed] [Google Scholar]
  • 13. Bustin SA, Benes V, Garson JA, et al. : The MIQE guidelines: minimum information for publication of quantitative real-time PCR experiments. Clin Chem. 2009;55(4):611–622. 10.1373/clinchem.2008.112797 [DOI] [PubMed] [Google Scholar]
  • 14. Lee JA, Spidlen J, Boyce K, et al. : MIFlowCyt: the minimum information about a Flow Cytometry Experiment. Cytometry A. 2008;73(10):926–930. 10.1002/cyto.a.20623 [DOI] [PMC free article] [PubMed] [Google Scholar]
F1000Res. 2018 Mar 29. doi: 10.5256/f1000research.14640.r31971

Referee response for version 1

Jeffrey R Sampson 1

The paper by Hesketh et al., addresses the very important issue of facilitating productive communication among highly interdisciplinary teams. This impacts not only verbal communication among interdisciplinary members but also written communications in the form of simple messages and publications.  It is also well noted that during peer review of publications, there is often lacking a single person with the necessary vocabulary and domain knowledge to fully understand, evaluate and communicate a review of the work. The method of Hesketh et al. will clearly aid in this important process. Importantly, they have developed a smart approach to the problem that can be applied more broadly to other interdisciplinary teams that require the integration of disparate fields of science and technology such as life sciences and engineering. For example, the synthetic biology community has experienced this issue as it has developed and evolved over the past 15 or so years.

More specifically, Hesketh et al. not only set a good structure and context that the interdisciplinary team developing the DNA as a digital information storage media face, but also provides some solutions to critical problems.  The first is creating a glossary of terms so that all disciplines involved can communicate with a common and known set of terms. Second, they have put forward the use of a “controlled vocabulary” where terms that are particular to the emerging interdisciplinary field are defined so as to enable all members to communicate precisely and thus reduce confusion that often occurs when terms have multiple meanings and/or field dependent meanings.  Perhaps most importantly, Hesketh et al., have built their approach as a “living document” where the vocabulary and common vocabulary can be continuously updated by the interdisciplinary community as the community grows and evolves. 

With respect to any additional comments or edits, I offer that the authors consider adding “Chemistry Terminology” to their glossary with specific attention to the chemical synthesis of DNA since this is the current method for DNA synthesis. Such terms could include; phosphoramidite, cycle yield, coupling efficiency, de-block step, oxidation step.

Given the importance, clarity and potential for broad applicability, I strongly recommend the paper for indexing.

I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

F1000Res. 2018 Feb 26. doi: 10.5256/f1000research.14640.r29657

Referee response for version 1

Robert Grass 1

The paper by E. Hesketh addresses very important problems of our current scientific landscape, and the ongoing movement to more interdisciplinary approaches:

  • Communication between scientists in a team

  • Peer Review

 

The authors discuss these two topics using a currently evolving research topic: the storage of digital information in DNA; but the addressed problems have a significantly broader applicability, as individual research topics spread over more and more scientific disciplines, and especially because data and computer sciences are having a major impact on science (and the corresponding high-level mathematics are currently not integrated into e.g. life-science curricula).

For the communication for scientists within a team, the authors present an excellent glossary of terms for the scientific fields involved in DNA data storage - and the development, and open publication/distribution of such glossaries would bring benefit to many interdisciplinary projects. Instead of a locally managed glossary (as proposed), are more open approach (e.g. as an open Wikipedia) would be even more beneficial and further motivate others to participate stronger in updating the glossary. Additionally, some referencing within the glossary would be additionally valuable - as often background in understanding an individual term is required. (as standard within Wikipedia). If the authors have good reasons for a non-public (i.e. wiki) approach, theses should be discussed in the article, if not, the generation of a corresponding wiki would be certainly highly appreciated by the research community.

However, to completely solve the communication problems and misunderstandings in such projects, the authors touch a point of even higher importance: “misunderstandings can pass unnoticed”, so the question is what solutions are available to make team members aware of the danger of miscommunication and, implement sufficient effort for every individual in a given project to learn the details, wordings and backgrounds of the neighboring fields- the authors may want to further build on this observation, and potentially present approaches to ensure such awareness and openness (especially in teams involving specialists).

The second problem of interdisciplinary projects addressed is peer-review. The more detailed background of different scientific fields is required to judge the correctness of scientific work performed, the more difficult it is to find individuals as paper referees who cover all of this knowledge. A plain text summary, as presented by the authors as part of a solution is certainly a good start, but probably does not go far enough. In contrast to individuals working on an interdisciplinary project (as above), a journal referee does not have enough time to learn details and wordings of the other fields, and the review process gets somewhat superficial. A general understanding of the overall goals of a given paper (as per plain text summary) may help the referee to understand the article scope, but it will not help him to judge the scientific validity of the methods applied. The authors of the present manuscript somewhat touch on this, and a more explicit depiction of the problem may be valuable to a further discussion of future publishing/peer-review modes (e.g. post-publication review, open-review, various referees only refereeing part of articles).

I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.


Articles from F1000Research are provided here courtesy of F1000 Research Ltd

RESOURCES