Abstract
Genome-scale engineering holds great potential to impact science, industry, medicine, and society, and recent improvements in DNA synthesis have enabled the manipulation of megabase genomes. However, coordinating and integrating the workflows and large teams necessary for gigabase genome engineering remains a considerable challenge. We examine this issue and recommend a path forward by: 1) adopting and extending existing representations for designs, assembly plans, samples, data, and workflows; 2) developing new technologies for data curation and quality control; 3) conducting fundamental research on genome-scale modeling and design; and 4) developing new legal and contractual infrastructure to facilitate collaboration.
Subject terms: Synthetic biology, Genomic engineering
Genome-scale engineering requires the integration of a wide range of in silico and in vivo technologies, as well data management procedures and legal infrastructure. Here the authors provide a list of recommendations to address these challenges.
Introduction
Engineering the entire genome of an organism promises to enable large-scale changes to its organization, function, and interactions with its environment, with broad potential for impacts across science, industry, medicine, and society1. The past several decades have seen remarkable progress in our capability to synthesize DNA and modify genomes2–4. Since Khorana created the first synthetic gene 40 years ago5, our capability to construct DNA sequences has doubled, approximately every 3 years (Fig. 1a), progressing from plasmids in the early 1990’s6,7, viruses in the early 2000’s8, and gene clusters in the mid-2000’s9,10, to the first bacterial chromosome in 200811,12. Recently, several groups have re-engineered the 4 Mb genomes of Escherichia coli13,14 and Salmonella typhimurium15, and the Synthetic Yeast (Sc 2.0) project16,17 has nearly completed re-engineering an 11.4 Mb genome for Saccharomyces cerevesiae18. Looking ahead, in 2016 leaders from academia and industry formed Genome Project-Write1 to initiate the engineering of the gigabase genomes of higher-order eukaryotes. The goals of the GP-Write consortium include engineering a virus-resistant, ultra-safe human-derived cell line for pharmaceutical production19.
From engineering genes to engineering genomes
Moving to the gigabase scale poses major technological and scientific challenges. Challenges related to DNA synthesis and editing have been discussed extensively in the literature20–23. Significant attention has also been devoted to the challenges of modeling24,25, designing17,26,27, and testing28 genomes. Less attention, however, has been devoted to the technologies, repositories, standards, and other resources needed to integrate these tasks into a cohesive workflow.
We contend that workflow integration is a first-class problem for gigabase-scale genome engineering. Over the last 40 years, the number of authors of pioneering genome engineering projects has risen markedly with genome size, suggesting that the complexity of genome engineering is also scaling with the size of the genome (Fig. 1b). If these trends continue, engineering a gigabase genome would be projected to become possible in ~2050 and require a team with the capabilities of around 500 investigators. To manage projects of such complexity without massive teams, we advocate for the development of an ecosystem of tools, services, automation, and other resources, which could enable a modestly sized team of bioengineers to indirectly access the equivalent capabilities of hundreds of people. To this end, we have examined the emerging design–build–test–learn workflow for genome engineering, identifying key interfaces and making recommendations for the adoption or development of technologies, repositories, standards, and frameworks.
Table 1.
Year | DNA size (bp) | Collaboration size (# authors) | Reference | Notes |
---|---|---|---|---|
1979 | 207 | 1 | Khorana5 | First synthetic gene |
1990 | 2050 | 4 | Mandecki et al.6 | First synthetic plasmid |
1995 | 2700 | 5 | Stemmer et al.7 | Synthetic plasmid |
2002 | 7.5000E+03 | 3 | Cello et al.8 | Polio virus cDNA |
2004 | 1.4600E+04 | 7 | Tian et al.9 | rRNA genes |
2004 | 3.1656E+04 | 6 | Kodumal et al.10 | Gene cluster |
2008 | 5.8297E+05 | 17 | Gibson et al.11 | Mycoplasma genitalium |
2010 | 5.3100E+05 | 24 | Gibson et al.12 | Mycoplasma mycoide, JCVI synthetic cell |
2011 | 9.1010E+04 | 15 | Dymond et al.16 | Sc 2.0 synIXR |
2014 | 2.7287E+05 | 80 | Annaluru et al.17 | Yeast chromosome synIII |
2016 | 3.9700E+06 | 21 | Ostrov et al.13 | Partially recoded E. coli, 62 K edits in genome |
2017 | 2.0000E+05 | 13 | Lau et al.15 | Salmonella typhimurium partial genome |
2019 | 4.0000E+06 | 14 | Fredens et al.14 | Recoded E. coli |
2020 | 1.14E+07 | 172 | Richardson et al.18 | Sc 2.0 estimated completion date; Genome size from Table 3 in reference; Collaboration size estimated from Sc 2.0 website |
These data are plotted in Fig. 1
An emerging workflow for genome engineering
Recently, a number of groups have proposed or developed workflows for organism engineering3,18,27–32, converging toward a common engineering cycle consisting of the four stages shown in Fig. 2. These stages are (1) Design: bioengineers use models and design heuristics to specify a genome with an intended phenotype; (2) Build: genetic engineers construct the desired DNA sequence in a target organism; (3) Test: experimentalists assay molecular and behavioral phenotypes of the engineered organism; (4) Learn: modelers analyze the discrepancies between the desired and observed phenotypes to develop improved models and design heuristics. The process is repeated until an organism with the desired phenotype is identified. This incremental approach enables engineering despite our incomplete understanding of the complexities of biology.
The inner loop in Fig. 2 indicates the workflow used by many current genome engineering projects, which have primarily focused on “top-down” refactoring of existing genomes, e.g., by rewriting codons or reducing genomes to essential sequences. In the longer term, one of the key aims of synthetic biology is to engineer organisms that have novel phenotypes by “bottom-up” assembly of modular parts and devices33. At a much smaller scale, organism engineers are already beginning to use this approach to engineer novel metabolic pathways for commercial production of high-value chemicals34–36. For gigabase genome engineering, this approach will likely require more complex workflows that utilize more sophisticated design tools, phenotypic assays, data analytics, and models (outer loop of Fig. 2).
Executing these multistep workflows requires exchanging a wide range of materials, information, and other resources between numerous tools, people, institutions, and repositories. The design phase must communicate genome designs to the build phase, the build phase must deliver DNA constructs and cell lines to the test phase, the test phase must transmit measurements to the learn phase, the learn phase must provide models and design heuristics to the design phase, and workflows must be applied to coordinate the interaction and execution of tools across all of these stages.
In addition to these technical challenges, genome engineering must also address a number of safety, security, legal, contractual, and ethical issues. Throughout genome engineering workflows, bioengineers must pay careful attention to biosafety, biosecurity, and cybersecurity. To execute genome engineering workflows across multiple institutions, bioengineers must navigate materials transfer agreements, copyrights, patents, and licenses.
Every aspect of this genome engineering workflow must be scaled up to handle gigabase genomes. Ultimately, much or all of each step should be automated, and each interface between steps should be formalized to facilitate machine reasoning, removing the ad hoc and human-centric aspects of genome engineering as much as possible. In many cases, this can be facilitated by adopting or extending solutions from smaller-scale genome engineering, as well as solutions from related fields such as systems biology, genomics, genetics, bioinformatics, software engineering, database engineering, and high-performance computing. Other challenges of gigabase genome engineering, however, are likely to require the development of novel systems or additional fundamental research.
Identifying and closing gaps in the state of the art
In this section, we discuss the integration challenges identified in the previous section, reviewing the state of the art in technologies and standards with respect to the emerging needs of gigabase genome engineering. Instead of focusing on specific evolving protocols and methods, which are likely to advance rapidly, we consider the information that must be communicated to enable protocols or methods to be composed into a comprehensive workflow. Through this analysis, we identify critical gaps and opportunities, where additional technologies and standards would facilitate workflows that can effectively deliver gigabase engineered genomes. Table 2 summarizes the potential solutions that we have identified, which are detailed in the following subsections.
Table 2.
For each interface in the emerging workflow, our recommendations fall into one of three categories: adopt or extend relatively mature existing methods (green), develop new solutions or expand nascent methods (yellow), and conduct additional fundamental research (red)
Genome refactoring and design
Current genome engineering projects have focused primarily on refactoring genomes while preserving their cellular function. For example, three recent projects have involved eliminating nonessential elements27, reordering genes17, and inserting metabolic pathways37. At this level, two critical challenges for scaling are accessing well-annotated source genomes and representing and exchanging designs for modified genomes. More complex changes of organism function will pose additional challenges related to composing parts to produce novel cellular functions.
Currently, genome design generally involves modifying pre-existing organism sequences, such as those available in the public archives of the International Nucleotide Sequence Database Collaboration (INSDC)38, which currently contains ~ bacterial genomes and hundreds of eukaryotic genomes39–43. Functional annotation is key, as genome engineers will need to consider tissue-specific expression patterns, regulatory elements, structural elements, replication origins, clinically significant sites of DNA recombination and instability, etc. The consistency of annotations is a key challenge, as many genomes have been annotated by different toolchains that produce significantly different annotations. For example, the human reference genomes generated by the RefSeq and GENCODE projects have notable differences44,45 with likely engineering consequences, such as ability to predict loss-of-function from interaction with alternative splicings. Much of this knowledge is also dispersed among different resources, though annotations can be integrated with the aid of services such as NCBI Genome Viewer46, WebGestalt47, and DAVID48. For moving to the gigabase scale, improved annotation APIs will be valuable, as would estimates of the confidence and reliability of annotations, such as the RefSeq database does with the Evidence and Conclusion Ontology49.
The gigabase scale poses challenges for the representation and exchange of genome designs as well. Common formats such as GenBank and EMBL are monolithic in their treatment of sequences, which makes it difficult to integrate or harmonize editing across multiple concurrent users, and can even cause difficulties in simply transferring the data. Two formats better suited for genome engineering are the Generic Feature Format (GFF) version 3 and the Synthetic Biology Open Language (SBOL) version 250. GFF3 allows hierarchical organization of sequence descriptions (e.g., genes may be organized into clusters, and clusters into chromosomes), uses the Sequence Ontology51 to annotated sequences, and has already been used in the Sc 2.0 genome engineering project18. SBOL 2 is also routinely used for hierarchical description of edited genomes52 and can interoperate with GFF3 (though GFF3 only represents a subset of SBOL)53. SBOL provides a richer design-centric language, including support for variants, libraries, and partial designs (e.g., identifying genes in a cluster, but not yet particular variants or cluster arrangement), other elements and cellular functions (e.g., proteins, metabolic pathways, regulatory interactions). SBOL also interoperates with models encoded in the Systems Biology Markup Language (SBML)54,55. Both GFF3 and SBOL, however, would benefit from more stable specifications of sequence positions within chromosomes, as sequence index is fragile to changes and sequence uncertainties. SBOL supports (and GFF3 could be extended to support) expression of nonstandard bases and sequence modifications in an enhanced sequence encoding language such as BpForms56.
Representations of genome designs also need to express design constraints and policies, such as removal of restriction sites, separation of overlapping features, replacement of codons, and optimization for DNA synthesis. Projects such as Sc 2.0 have implemented this with a combination of guidelines for human hand-editing and custom software tools, and DNA synthesis providers provide interfaces to check for manufacturability constraints. At the gigabase scale, however, it will be beneficial to adopt more powerful and expressive languages for describing design policies, such as rule-based ontologies57,58, and to include assembly and transformation plans in design representations to simplify adjustments for manufacturability. JGI’s BOOST tool provides a prototype in this direction59. SBOL is well-suited for this task, though GenBank and GFF3 could also, at least in principle, be extended to encode such information.
Modeling will become increasingly important as genome engineering moves beyond refactoring and recoding into more complex changes to an organism’s function. Genome-scale metabolic models60,61 and whole-cell models62 can be constructed by combining biochemical and genomic information from multiple databases, such as BioCyc63 and the SEED64. Models will also need to predict the behavior of organisms that are composed of separately characterized genetic parts, devices, pathways, and genome fragments. Substantial fundamental research still needs to be conducted to make such models practical at the gigabase scale.
Building engineered genomes
Technology and protocols for building engineered genomes are advancing rapidly, with potential paths to the gigabase scale discussed, for example, in ref. 1 and ref. 23. Depending on the specific host and intended function of the engineered organism, there are numerous potential approaches and protocols for DNA synthesis, assembly, and delivery. Currently, there is an unmet need for guidance on best practices for measuring, tracking, and sharing information regarding engineered genomes and intermediate samples.
Manipulating DNA during assembly offers ample opportunities for reduced yield, breakage, error, and other sources of uncertainty in achieving the designed DNA sequence. Protocols and commercial kits to assemble shorter DNA fragments into larger constructs often involve amplification, handling, purification, transformation, or other storage and delivery steps that can increase uncertainty in the quality and quantity of the DNA. Assembled DNA may also include added sequences that are not biologically active, as in the case for some methods using restriction enzymes, or scars, such as occur may occur with Golden Gate Assembly65 or MoClo66. Gibson Assembly67 is scarless, but the yield and specific results may depend on the secondary structure of the DNA fragments. Thus, in addition to sequence information, workflows will likely need extended representations that can also track the full range of information likely to affect assembly products, including DNA secondary structure, assembly method, sequences required for assembly and their location along the DNA molecule (e.g., landing pads or sequences for compatibility with protocol-hosting strains of E. coli or yeast), and intended epigenetic modifications. The results verifying both intermediate and final sequence onstruction are typically produced in the FASTQ format68, which is generally sufficient for smaller constructs. To operate on large-scale genomes, however, more comprehensive descriptions of a genome and its variations may be made with representations such as GVF69 or SBOL70.
Suitable options for the delivery of large, assembled DNA constructs and whole genomes are generally lacking. The yield of existing processes, such as electrical and chemical transformation or genome transplantation, could be improved significantly to increase their utility, and a broader range of approaches should be developed for use with any organism and cell type. This may also require identifying new cell-free environments or cell-based chassis for assembling and manipulating DNA that also have compatibility with genome packaging and delivery systems into host organisms. To facilitate such development, delivery protocols and their associated information regarding number of biological and technical replicate experiments, methods, measurements, etc. should be available in a machine-readable format. This should include information regarding the host cell, such as its genotype, which is often not fully verified. The adoption of best practices from industrial biomanufacturing settings and implementation of laboratory information management systems (LIMS) could provide a path forward toward integrating appropriate measurements, process controls, and information handling, as well as the tracking and exchange of samples. Advancing the use of automation to support the build step of the genome engineering workflow requires evaluating which steps may reduce costs and speed results, the availability of automated methods, ways to effectively share those methods and adapt them across platforms and manufacturers, and ways to more simply integrate and tune automated workflows.
Testing the function of engineered genomes
Strain fitness and other phenotypes can be assessed via a wide range of biochemical and omics measurements, the details of which are beyond the scope of this discussion. In all cases, however, collaborating organizations will need to agree on specific measurements, along with control and calibration measurements, to ensure that the results can be compared and used across the participating laboratories.
DNA constructs are often evaluated for their associated growth phenotypes to determine the nature and extent of unexpected consequences for cell function and fitness due to the revised genome sequence. Engineered cell lines should also be evaluated for robustness to changes in the environmental context that the cells are likely to experience during typical use in the intended application, as well as stability over relevant timescales to evolution or adaptation. This is complicated by the need for shared definitions and measurements for fitness, metabolic burden, and other phenotypic properties.
Standard protocols, reference cell lines, and the use of experimental design are examples of tools available to increase the rigor and confidence in conclusions that can be drawn from testing. It will likely also be useful to develop standards and measurement assurance for testing engineered genomes. Such foundations can be used to help identify relationships between genotype and phenotype or determine the contributions of biological stochasticity and measurement uncertainty to the overall variability in a measured trait, though comprehensive methods of this sort are likely to require significant fundamental research.
Calibration of biological assays aids in comparing results both within a single laboratory and across different laboratories. Recent studies, for example, for fluorescence71,72, absorbance73, and RNAseq74 measurements, demonstrate the possibility of realizing scalable and cost-effective comparability in biological measurements. Organism engineering is likely to be facilitated by the development of additional calibrated measurement methods and absolute quantitation of an organism’s properties.
Establishing shared representations and practices for metadata, process controls, and calibration will also be critical. Automation-assisted integration and comparison of the data, metadata, process controls, and calibration across laboratories will facilitate both the testing process and learning through modeling and simulation. Some existing ontologies can be leveraged for this purpose, such as the Experimental Conditions Ontology75 (ECO), the Experimental Factor Ontology76 (EFO), and the Measurement Method Ontology75 (MMO). In addition, appropriate LIMS tooling and curation assistance software (e.g., RightField77) will be vital for enabling such metadata to be created consistently, correctly, and in a timely fashion, by limiting the required input from human investigators.
Learning systematically from test results
As genome engineering affects systems throughout an organism, comprehensive models are needed that can help to both predict and interpret the relationship between genotype and phenotype. Although some models have been constructed for a whole cell62 or whole organism78, developing and tuning such models is extremely challenging. To scale to gigabase genomes, it will be valuable develop improved capabilities for creating, calibrating, and verifying models.
The first challenge in learning from the data is discovering and marshaling the data needed. Partial solutions exist, such as the workflow model introduced in SBOL 2.250, and ontologies such as the Open Biological and Biomedical Ontology79, the Experimental Factor Ontology76, the Systems Biology Ontology80, and phenotype ontologies81,82. These will need to be integrated and extended to cover the full range of needs for genome engineering.
Automation-assisted generation and verification of models at scale, however, still have many open fundamental research challenges, including addressing the combinatorial complexity of biology and the multiple scales between genomes and organismal behavior, high-performance simulation of large models, model verification, and representation of model semantic meaning and provenance24,25.
Until we have comprehensive predictive models, engineers will likely rely on ad hoc combinations of predictive models of parts of organisms, data-driven models, and heuristic design rules. For example, constraint-based models are often used in metabolic engineering34, PSORTb83 can be used to help target proteins to specific compartments, and GC-content optimization can be used to improve host compatibility84. Gigabase-scale genome engineering will require applying many such models simultaneously, and thus will benefit from adopting existing standard formats designed to facilitate biological model sharing and composition, such as SBML85, CellML86, NeuroML87, and other standards in the Computational Modeling in Biology Network (COMBINE)88. Large numbers of models in these formats can already be found in public databases, such as BioModels89, the NeuroML database90, Open Source Brain91, and the Physiome Model Repository92. Similarly, repositories such as Kipoi93 and the DockerHub repository94 can already be used to share data-driven models. Further extensions to such formats, however, will be valuable for automating the learning process, including associating semantic meaning with model components, capturing the provenance of model elements (e.g., data sources, assumptions, and design motivations), and capturing information about their predictive capabilities and applicable scope.
To increase automation in learning such models from data, it will likely be valuable to develop new repositories of models of individual biological parts that can be composed into models of entire organisms95,96; new methods for generating model variants that explain new observations by incorporating models of additional parts, alternative kinetic laws, or alternative parameter values; and new model selection techniques for nonlinear multiscale models97.
Coordination and sharing in complex workflows
Tasks in isolation are not enough: efficient operation of the design–build–test–learn cycle for engineering gigabase genomes will require coordinating all of the numerous heterogeneous tasks discussed into clear, cohesive, reproducible workflows98,99 for software interactions, for laboratory protocols, and for management of tasks and personnel. Automating workflows also provides opportunities to implement best practices for cybersecurity, cyberbiosecurity, and biosecurity.
For integrating informational tasks, computational workflow engines enable specification, reproducible execution, and exchange of complex workflows involving multiple software programs and computing environments. Current workflow tools include both general tools, such as the Common Workflow Language (CWL)100, the Dockstore101 and MyExperiment102 sharing environments, and the PROV ontology for tracking information provenance103 (which is already being applied to link design–build–test–learn cycles in SBOL50). There are also a number of bioinformatics-focused engines, including Cromwell104, Galaxy105, NextFlow106, and Toil107. These can be readily adopted for gigabase engineering through steps such as including CWL files in COMBINE archives108, developing REST or other programmatic interfaces for databases used in genome engineering, containerization109 of genome engineering computational tools, and depositing these containers to a registry such as DockerHub94. Other enhancements likely to be useful include the development of graphical workflow tools for genome engineering, an ontology for annotating the semantic meaning of workflow tasks, and the application of issue tracking systems, such as GitHub issues110 or Jira111, to help coordinate teams on the complex tasks involved in designing genomes that require human intervention.
For experimental protocols, a number of technologies have already been developed to automate and integrate experimental workflows as well. Laboratory automation systems can greatly improve both reproducibility and efficiency112 and can also be integrated with LIMS113 to help track workflows and reagent stocks. A number of automation languages and systems have been developed, including Aquarium114, Antha115, and Autoprotocol116. Although these have not been widely adopted, they have been successfully applied to genetic engineering (e.g., ref. 117), and gigascale genome engineering would benefit from standardization and integration of such systems for application to build and test protocols.
Once links are established across different portions of a workflow, unified access to information in databases for various institutions and stages of the workflow can be accomplished using standard federation methods and any of the various mature open tools for database management systems (DBMS). Scalable sharing would be further enhanced by adoption of the FAIR (findable, accessible, interoperable, reproducible) data management principles118, which puts specific emphasis on automation friendliness of data sharing. Repositories that support these principles and are applicable to genome engineering include FAIRDOMHub119, Experimental Data Depot (EDD)120, and SynBioHub121.
Contracts, intellectual property, and laws
Large-scale genome engineering also poses novel challenges in coordinating legal and contractual interactions. When using digital information, both humans and machines need to know the accompanying copyright and licensing obligations. Systematic licensing regimes have been developed for software by the Open Source Initiative (OSI) and other software organizations122 and for media and other content with the Creative Commons (CC) family of licenses123, both of which readily allow either a user or a machine to determine if a digital object can be reused, if its reuse is prohibited, or if more complicated negotiation or determination is required. Such systems can be applied to much of the digital information in genome engineering. Care will need to be taken, however, regarding sensitive personal information and European Union database protection rights, which these do not address.
Transfer of physical biological materials was first standardized in 1995 with NIH’s Uniform Biological Materials Transfer Agreement (UBMTA), which is used extensively by organizations such as Addgene. Broader and more compatible systems have been developed in the form of the Science Commons project124 and the OpenMTA125. There are still significant open problems regarding compliance with local regulatory and legal systems, however, particularly when materials cross international borders. Moreover, material transfer agreements generally do not address the intellectual property for materials, which is typically governed through patent law. No publicly available system yet supports automation for patent licensing. Development of automation-friendly intellectual property management might be supported by defining tiered levels that are simultaneously intelligible for the common user, legal experts, and computer systems—though establishing which material or usages can be classified into which tiers may be a difficult process of legal interpretation. Effective use in automation-assisted workflows will also require recording information about which inputs are involved in the production of results, using mechanisms such as the PROV ontology103.
Finally, organizations will also need to manage the level of exposure of information, whether due to issues of privacy, safety, publication priority, or other similar concerns. Again, no current system exists, but a basis for developing one may be found in the cross-domain information sharing protocols that have been developed in other domains126,127.
Recommendations and outlook
In summary, scaling up to gigabase genomes presents a wide range of challenges (Table 2). We observe that these challenges cluster into four general themes, each with a different set of needs and paths for development.
The first theme is representing and exchanging designs, plans, data, metadata, and knowledge. Managing information for gigabase genome design requires addressing many challenges regarding scale, representation, and standards. Relatively mature technologies exist to address most individual needs, as well as to assist with the integration of workflows. The practical implementation of effective workflows will require significant investment in building infrastructure and tools that adopt these technologies, including domain-specific extensions and refinements.
The second theme is sharing and integrating data quality and experimental measurements. Sharing and integrating information arising from measurements of biological material poses significant challenges. It remains unclear what information would be advantageous to share, given the difficulty of obtaining and interpreting measurements of biological systems and the expense and unfavorable scaling of data curation. However, effective integration depends on associating reproducible measurement data with well-curated knowledge and metadata in compatible representations. A number of potential solutions exist for each of these, but significant investment will be needed to investigate how the state of the art can be extended to address these needs.
The third theme is integration of modeling and design at the gigabase scale. Considerable challenges surround efforts to develop a deeper understanding of the relationship between genotype and phenotype, regarding both the interpretation of experimental data and the application of that data to create and validate models, which may be applied in computer-assisted design. Long-term investment in fundamental research is needed, and the suite of biological systems of varying complexity, from cell-free systems to minimal and synthetic cells to natural living systems, may offer suitable experimental platforms for learning the relationship between genotype and phenotype.
Finally, the fourth theme is technical support for Ethical, Legal, and Societal Implications (ELSI) and Intellectual Property (IP) at scale. At the gigabase scale, computer-assisted workflows will be necessary to manage contracts, intellectual property, materials transfers, and other legal and societal interactions. Such workflows will need to be developed by interdisciplinary teams involving experts in law, ELSI issues, software engineering, and knowledge representation. Moreover, it will be critical to address these issues early, to minimize the potential for problematic entanglements associated with the reuse of resources.
In short, engineering gigabase-scale genomes presents significant challenges that will require coordinated investment to overcome. Because many other areas of bioscience face similar challenges, solutions to these challenges will likely also benefit the broader bioscience community. Importantly, the challenges of scale, integration, and lack of knowledge faced in genome engineering are not fundamentally different in nature than those that have been overcome previously in other engineering ventures, such as aerospace engineering and microchip design, which required organizing humans and sharing information across many institutions over time. Thus, we expect to be able to adapt solutions from these other fields for genome engineering.
Investment in capabilities for genome engineering workflows is critical to move from a world in which genome engineering is a heroic effort to one in which genome engineering is routine, safe, and reliable. Investment in workflows for genome engineering will support and enable a vast number of projects, including many not yet conceived, as was the case for reading the human genome. As workflow technologies improve, we anticipate that the trends of expanding team size will eventually reverse, enabling high-fidelity whole-genome engineering at a modest cost and supporting a wide range of medical and industrial applications.
Acknowledgements
This work was supported, in part, by NIH awards P41-EB023912 and R35-GM119771 and by NSF awards 1548123 and 1522074. We thank the reviewers for legal insights for Section “Contracts, intellectual property, and laws”, and thank Nicola Hawes for help illustrating Fig. 2. The views, opinions, and/or findings expressed are those of the author(s) and should not be interpreted as representing the official views or policies of these funding agencies or the U.S. Government. This document does not contain technology or technical data controlled under either U.S. International Traffic in Arms Regulation or U.S. Export Administration Regulations. Certain commercial equipment, instruments, or materials are identified to adequately specify experimental procedures. Such identification neither implies recommendation nor endorsement by the National Institute of Standards and Technology nor that the equipment, instruments, or materials identified are necessarily the best for the purpose.
Competing interests
The authors declare no competing interests.
Footnotes
Peer review information Nature Communications thanks David Grewal, Tom Ellis and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Bryan A. Bartley, Jacob Beal, Jonathan R. Karr, Elizabeth A. Strychalski.
References
- 1.Boeke JD, et al. The genome project-write. Science. 2016;353:126–127. doi: 10.1126/science.aaf6850. [DOI] [PubMed] [Google Scholar]
- 2.Carlson, R. H. Biology Is Technology: The Promise, Peril, and New Business of Engineering Life (Harvard University Press, 2011).
- 3.Hughes RA, Ellington AD. Synthetic DNA synthesis and assembly: putting the synthetic in synthetic biology. Cold Spring Harb. Perspect. Biol. 2017;9:a023812. doi: 10.1101/cshperspect.a023812. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Chari R, Church GM. Beyond editing to writing large genomes. Nat. Rev. Genet. 2017;18:749. doi: 10.1038/nrg.2017.59. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Khorana HG. Total synthesis of a gene. Science. 1979;203:614–625. doi: 10.1126/science.366749. [DOI] [PubMed] [Google Scholar]
- 6.Mandecki W, Hayden MA, Shallcross MA, Stotland E. A totally synthetic plasmid for general cloning, gene expression and mutagenesis in Escherichia coli. Gene. 1990;94:103–107. doi: 10.1016/0378-1119(90)90474-6. [DOI] [PubMed] [Google Scholar]
- 7.Stemmer WPC, Crameri A, Ha KD, Brennan TM, Heyneker HL. Single-step assembly of a gene and entire plasmid from large numbers of oligodeoxyribonucleotides. Gene. 1995;164:49–53. doi: 10.1016/0378-1119(95)00511-4. [DOI] [PubMed] [Google Scholar]
- 8.Cello J, Paul AV, Wimmer E. Chemical synthesis of poliovirus cDNA: generation of infectious virus in the absence of natural template. Science. 2002;297:1016–1018. doi: 10.1126/science.1072266. [DOI] [PubMed] [Google Scholar]
- 9.Tian J, et al. Accurate multiplex gene synthesis from programmable DNA microchips. Nature. 2004;432:1050. doi: 10.1038/nature03151. [DOI] [PubMed] [Google Scholar]
- 10.Kodumal SJ, et al. Total synthesis of long DNA sequences: synthesis of a contiguous 32-kb polyketide synthase gene cluster. Proc. Natl Acad. Sci. USA. 2004;101:15573–15578. doi: 10.1073/pnas.0406911101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Gibson DG, et al. Complete chemical synthesis, assembly, and cloning of a Mycoplasma genitalium genome. Science. 2008;319:1215–1220. doi: 10.1126/science.1151721. [DOI] [PubMed] [Google Scholar]
- 12.Gibson DG, et al. Creation of a bacterial cell controlled by a chemically synthesized genome. Science. 2010;329:52–56. doi: 10.1126/science.1190719. [DOI] [PubMed] [Google Scholar]
- 13.Ostrov N, et al. Design, synthesis, and testing toward a 57-codon genome. Science. 2016;353:819–822. doi: 10.1126/science.aaf3639. [DOI] [PubMed] [Google Scholar]
- 14.Fredens J, et al. Total synthesis of Escherichia coli with a recoded genome. Nature. 2019;569:514. doi: 10.1038/s41586-019-1192-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Lau YH, et al. Large-scale recoding of a bacterial genome by iterative recombineering of synthetic DNA. Nucleic Acids Res. 2017;45:6971–6980. doi: 10.1093/nar/gkx415. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Dymond JS, et al. Synthetic chromosome arms function in yeast and generate phenotypic diversity by design. Nature. 2011;477:471. doi: 10.1038/nature10403. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Annaluru N, et al. Total synthesis of a functional designer eukaryotic chromosome. Science. 2014;344:55–58. doi: 10.1126/science.1249252. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Richardson SM, et al. Design of a synthetic yeast genome. Science. 2017;355:1040–1044. doi: 10.1126/science.aaf4557. [DOI] [PubMed] [Google Scholar]
- 19.The GP write Leadership Group. Ultra-safe cells resistant to natural viruses announced as first gp-write grand-scale project https://www.engineeringbiologycenter.org/press/may2018.pdf (2018).
- 20.Ellis T, Adie T, Baldwin GS. DNA assembly for synthetic biology: from parts to pathways and beyond. Integr. Biol. 2011;3:109–118. doi: 10.1039/c0ib00070a. [DOI] [PubMed] [Google Scholar]
- 21.Esvelt KM, Wang HH. Genome-scale engineering for systems and synthetic biology. Mol. Syst. Biol. 2013;9:641. doi: 10.1038/msb.2012.66. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Kosuri S, Church GM. Large-scale de novo DNA synthesis: technologies and applications. Nat. Methods. 2014;11:499. doi: 10.1038/nmeth.2918. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Ostrov N, et al. Technological challenges and opportunities for writing genomes. Science. 2019;366:310–312. doi: 10.1126/science.aay0339. [DOI] [PubMed] [Google Scholar]
- 24.Goldberg AP, et al. Emerging whole-cell modeling principles and methods. Curr. Opin. Biotechnol. 2018;51:97–102. doi: 10.1016/j.copbio.2017.12.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Szigeti B, et al. A blueprint for human whole-cell modeling. Curr. Opin. Syst. Biol. 2018;7:8–15. doi: 10.1016/j.coisb.2017.10.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Nielsen AAK, et al. Genetic circuit design automation. Science. 2016;352:aac7341. doi: 10.1126/science.aac7341. [DOI] [PubMed] [Google Scholar]
- 27.Hutchison CA, et al. Design and synthesis of a minimal bacterial genome. Science. 2016;351:aad6253. doi: 10.1126/science.aad6253. [DOI] [PubMed] [Google Scholar]
- 28.Carbonell P, et al. An automated design-build-test-learn pipeline for enhanced microbial production of fine chemicals. Commun. Biol. 2018;1:66. doi: 10.1038/s42003-018-0076-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Appleton E, Densmore D, Madsen C, Roehner N. Needs and opportunities in bio-design automation: four areas for focus. Curr. Opin. Chem. Biol. 2017;40:111–118. doi: 10.1016/j.cbpa.2017.08.005. [DOI] [PubMed] [Google Scholar]
- 30.Gill RT, Halweg-Edwards AL, Clauset A, Way SF. Synthesis aided design: the biological design-build-test engineering paradigm? Biotechnol. Bioeng. 2016;113:7–10. doi: 10.1002/bit.25857. [DOI] [PubMed] [Google Scholar]
- 31.Poust S, Hagen A, Katz L, Keasling JD. Narrowing the gap between the promise and reality of polyketide synthases as a synthetic biology platform. Curr. Opin. Biotechnol. 2014;30:32–39. doi: 10.1016/j.copbio.2014.04.011. [DOI] [PubMed] [Google Scholar]
- 32.Pouvreau B, Vanhercke T, Singh S. From plant metabolic engineering to plant synthetic biology: the evolution of the design/build/test/learn cycle. Plant Science. 2018;273:3–12. doi: 10.1016/j.plantsci.2018.03.035. [DOI] [PubMed] [Google Scholar]
- 33.Purnick PEM, Weiss R. The second wave of synthetic biology: from modules to systems. Nat. Rev. Mol. Cell Biol. 2009;10:410. doi: 10.1038/nrm2698. [DOI] [PubMed] [Google Scholar]
- 34.Kim HU, Kim TY, Lee SY. Metabolic flux analysis and metabolic engineering of microorganisms. Mol. Biosyst. 2008;4:113–120. doi: 10.1039/B712395G. [DOI] [PubMed] [Google Scholar]
- 35.Woolston BM, Edgar S, Stephanopoulos G. Metabolic engineering: past and future. Annu. Rev. Chem. Biomol. Eng. 2013;4:259–288. doi: 10.1146/annurev-chembioeng-061312-103312. [DOI] [PubMed] [Google Scholar]
- 36.King ZA, Lloyd CJ, Feist AM, Palsson BO. Next-generation genome-scale models for metabolic engineering. Curr. Opin. Biotechnol. 2015;35:23–29. doi: 10.1016/j.copbio.2014.12.016. [DOI] [PubMed] [Google Scholar]
- 37.Steen E J, et al. Metabolic engineering of Saccharomyces cerevisiae for the production of n-butanol. Microbial. Cell Factories. 2008;7:36. doi: 10.1186/1475-2859-7-36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Cochrane G, Karsch-Mizrachi I, Takagi T, International Nucleotide Sequence Database Collaboration. The international nucleotide sequence database collaboration. Nucleic Acids Res. 2015;44:D48–D50. doi: 10.1093/nar/gkv1323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Cunningham F, et al. Ensembl 2019. Nucleic Acids Res. 2018;47:D745–D751. doi: 10.1093/nar/gky1113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Kersey PJ, et al. Ensembl genomes 2016: more genomes, more complexity. Nucleic Acids Res. 2015;44:D574–D580. doi: 10.1093/nar/gkv1209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.O’Leary NA, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2015;44:D733–D745. doi: 10.1093/nar/gkv1189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Haft DH, et al. RefSeq: an update on prokaryotic genome annotation and curation. Nucleic Acids Res. 2017;46:D851–D860. doi: 10.1093/nar/gkx1068. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Mashima J, et al. Dna data bank of japan (DDBJ) progress report. Nucleic Acids Res. 2015;44:D51–D57. doi: 10.1093/nar/gkv1105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Frankish A, et al. Comparison of gencode and refseq gene annotation and the impact of reference geneset on variant effect prediction. BMC Genomics. 2015;16:S2. doi: 10.1186/1471-2164-16-S8-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.McCarthy DJ, et al. Choice of transcripts and software has a large effect on variant annotation. Genome Med. 2014;6:26. doi: 10.1186/gm543. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.National Center for Biotechnology Information. Genome data viewer, https://www.ncbi.nlm.nih.gov/genome/gdv/ (2019).
- 47.Liao Y, Wang J, Jaehnig EJ, Shi Z, Zhang B. WebGestalt 2019: gene set analysis toolkit with revamped UIs and APIs. Nucleic Acids Res. 2019;47:W199–W205. doi: 10.1093/nar/gkz401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 2009;4:44. doi: 10.1038/nprot.2008.211. [DOI] [PubMed] [Google Scholar]
- 49.Chibucos, M. C., Siegele, D. A., Hu, J. C. & Giglio, M. The evidence and conclusion ontology (ECO): supporting GO annotations. In (Dessimoz C. & Škunca N. eds.) The Gene Ontology Handbook 245–259 (Humana Press, New York, 2017). [DOI] [PMC free article] [PubMed]
- 50.Cox, R. S. et al. Synthetic biology open language (SBOL) version 2.2. 0. J. Integr. Bioinform.15, 20180001 (2018). [DOI] [PMC free article] [PubMed]
- 51.Eilbeck K, et al. The sequence ontology: a tool for the unification of genome annotations. Genome Biol. 2005;6:R44. doi: 10.1186/gb-2005-6-5-r44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.SBOL Community. Synbiohub 1.4, https://github.com/SynBioHub (2019).
- 53.SBOL Community. libsbolj 2.5-prerelease, https://github.com/SynBioDex/libSBOLj (2019).
- 54.Nguyen T, Roehner N, Zundel Z, Myers CJ. A converter from the systems biology markup language to the synthetic biology open language. ACS Synth. Biol. 2016;5:479–486. doi: 10.1021/acssynbio.5b00212. [DOI] [PubMed] [Google Scholar]
- 55.Roehner N, Zhang Z, Nguyen T, Myers CJ. Generating systems biology markup language models from the synthetic biology open language. ACS Synth. Biol. 2015;4:873–879. doi: 10.1021/sb5003289. [DOI] [PubMed] [Google Scholar]
- 56.Lang, P. F., Chebaro, Y. & Karr, J. R. BpForms: a toolkit for concretely describing modified DNA, RNA and proteins. Preprint at arXiv preprint arXiv:1903.10042, (2019).
- 57.Venkatachalam AR, Mellichamp JM, Miller DM. A knowledge-based approach to design for manufacturability. J. Intell. Manuf. 1993;4:355–366. doi: 10.1007/BF00123780. [DOI] [Google Scholar]
- 58.Abrantes, R. et al. Rule ontology for automatic design verification application to PCB manufacturing and assembly. In IECON 2017-43rd Annual Conference of the IEEE Industrial Electronics Society 3403–3409 (IEEE, 2017).
- 59.Oberortner E, Cheng JF, Hillson NJ, Deutsch S. Streamlining the design-to-build transition with build-optimization software tools. ACS Synth. Biol. 2016;6:485–496. doi: 10.1021/acssynbio.6b00200. [DOI] [PubMed] [Google Scholar]
- 60.Swainston N, et al. Recon 2.2: from reconstruction to model of human metabolism. Metabolomics. 2016;12:109. doi: 10.1007/s11306-016-1051-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Monk JM, et al. iml1515, a knowledgebase that computes Escherichia coli traits. Nat. Biotechnol. 2017;35:904. doi: 10.1038/nbt.3956. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Karr JR, et al. A whole-cell computational model predicts phenotype from genotype. Cell. 2012;150:389–401. doi: 10.1016/j.cell.2012.05.044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Karp PD, et al. The biocyc collection of microbial genomes and metabolic pathways. Brief. Bioinform. 2017;20:1085–1093. doi: 10.1093/bib/bbx085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Overbeek R, et al. The SEED and the Rapid Annotation of microbial genomes using subsystems technology (RAST) Nucleic Acids Res. 2013;42:D206–D214. doi: 10.1093/nar/gkt1226. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Engler C, Gruetzner R, Kandzia R, Marillonnet S. Golden gate shuffling: a one-pot dna shuffling method based on type IIs restriction enzymes. PloS ONE. 2009;4:e5553. doi: 10.1371/journal.pone.0005553. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Weber E, Engler C, Gruetzner R, Werner S, Marillonnet S. A modular cloning system for standardized assembly of multigene constructs. PloS ONE. 2011;6:e16765. doi: 10.1371/journal.pone.0016765. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Gibson DG, et al. Enzymatic assembly of dna molecules up to several hundred kilobases. Nat. Methods. 2009;6:343. doi: 10.1038/nmeth.1318. [DOI] [PubMed] [Google Scholar]
- 68.Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2009;38:1767–1771. doi: 10.1093/nar/gkp1137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Reese MG, et al. A standard variation file format for human genome sequences. Genome Biol. 2010;11:R88. doi: 10.1186/gb-2010-11-8-r88. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Roehner N, et al. Sharing structure and function in biological design with SBOL 2.0. ACS Synth. Biol. 2016;5:498–506. doi: 10.1021/acssynbio.5b00215. [DOI] [PubMed] [Google Scholar]
- 71.Wang L, Hoffman RA. Standardization, calibration, and control in flow cytometry. Curr. Protoc. Cytom. 2017;79:1–3. doi: 10.1002/cpcy.14. [DOI] [PubMed] [Google Scholar]
- 72.Beal J, et al. Quantification of bacterial fluorescence using independent calibrants. PLoS One. 2018;13:e0199432. doi: 10.1371/journal.pone.0199432. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Stevenson K, McVey AF, Clark IBN, Swain PS, Pilizota T. General calibration of microbial growth in microplate readers. Sci. Rep. 2016;6:38828. doi: 10.1038/srep38828. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Lee H, Pine PS, McDaniel J, Salit M, Oliver B. External RNA controls consortium beta version update. J. Genomics. 2016;4:19. doi: 10.7150/jgen.16082. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Shimoyama Mary, et al. Three ontologies to define phenotype measurement data. Front. Genet. 2012;3:87. doi: 10.3389/fgene.2012.00087. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Malone J, et al. Modeling sample variables with an experimental factor ontology. Bioinformatics. 2010;26:1112–1118. doi: 10.1093/bioinformatics/btq099. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Wolstencroft K, et al. RightField: embedding ontology annotation in spreadsheets. Bioinformatics. 2011;27:2021–2022. doi: 10.1093/bioinformatics/btr312. [DOI] [PubMed] [Google Scholar]
- 78.Sarma GP, et al. OpenWorm: overview and recent advances in integrative biological simulation of Caenorhabditis elegans. Philos. Trans. Royal Soc. B. 2018;373:20170382. doi: 10.1098/rstb.2017.0382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Smith B, et al. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat. Biotechnol. 2007;25:1251. doi: 10.1038/nbt1346. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Courtot M, et al. Controlled vocabularies and semantics in systems biology. Mol. Syst. Biol. 2011;7:543. doi: 10.1038/msb.2011.77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Chibucos MC, et al. An ontology for microbial phenotypes. BMC Microbiol. 2014;14:294. doi: 10.1186/s12866-014-0294-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.K�hler S, et al. The human phenotype ontology in 2017. Nucleic Acids Res. 2016;45:D865–D876. doi: 10.1093/nar/gkw1039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Yu NY, et al. PSORTb 3.0: improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes. Bioinformatics. 2010;26:1608–1615. doi: 10.1093/bioinformatics/btq249. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Villalobos A, Ness JE, Gustafsson C, Minshull J, Govindarajan S. Gene designer: a synthetic biology tool for constructing artificial dna segments. BMC Bioinformatics. 2006;7:285. doi: 10.1186/1471-2105-7-285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Hucka M, et al. The Systems Biology Markup Language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics. 2003;19:524–531. doi: 10.1093/bioinformatics/btg015. [DOI] [PubMed] [Google Scholar]
- 86.Cuellar A, et al. The CellML 1.1 specification. J. Integr. Bioinform. 2015;12:4–85. doi: 10.1515/jib-2015-259. [DOI] [PubMed] [Google Scholar]
- 87.Gleeson P, Crook S, Silver A, Cannon R. Development of NeuroML version 2.0: greater extensibility, support for abstract neuronal models and interaction with systems biology languages. BMC Neurosci. 2011;12:P29. doi: 10.1186/1471-2202-12-S1-P29. [DOI] [Google Scholar]
- 88.Hucka M, et al. Promoting coordinated development of community-based information standards for modeling in biology: the COMBINE initiative. Front. Bioeng. Biotechnol. 2015;3:19. doi: 10.3389/fbioe.2015.00019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Glont M, et al. BioModels: expanding horizons to include more modelling approaches and formats. Nucleic Acids Res. 2017;46:D1248–D1253. doi: 10.1093/nar/gkx1023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Crook SM, Dietrich S. Model exchange with the NeuroML model database. BMC Neurosci. 2014;15:P171. doi: 10.1186/1471-2202-15-S1-P171. [DOI] [Google Scholar]
- 91.Gleeson P, et al. Open source brain: a collaborative resource for visualizing, analyzing, simulating, and developing standardized models of neurons and circuits. Neuron. 2019;103:395–411. doi: 10.1016/j.neuron.2019.05.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Yu T, et al. The physiome model repository 2. Bioinformatics. 2011;27:743–744. doi: 10.1093/bioinformatics/btq723. [DOI] [PubMed] [Google Scholar]
- 93.Avsec Ž, et al. The kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol. 2019;37:592–600. doi: 10.1038/s41587-019-0140-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Docker Inc. DockerHub, https://hub.docker.com/ (2019).
- 95.Cooling MT, et al. Standard virtual biological parts: a repository of modular modeling components for synthetic biology. Bioinformatics. 2010;26:925–931. doi: 10.1093/bioinformatics/btq063. [DOI] [PubMed] [Google Scholar]
- 96.Cowan AE, Mendes P, Blinov ML. Modelbricks?modules for reproducible modeling improving model annotation and provenance. NPJ Syst. Biol. Appl. 2019;5:1–6. doi: 10.1038/s41540-019-0114-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97.Kirk P, Thorne T, Stumpf MPH. Model selection in systems and synthetic biology. Curr. Opin. Biotechnol. 2013;24:767–774. doi: 10.1016/j.copbio.2013.03.012. [DOI] [PubMed] [Google Scholar]
- 98.Myers CJ, et al. A standard-enabled workflow for synthetic biology. Biochem. Soc. Trans. 2017;45:793–803. doi: 10.1042/BST20160347. [DOI] [PubMed] [Google Scholar]
- 99.Moreno AG, et al. An implementation-focused bio/algorithmic workflow for synthetic biology. ACS Synth. Biol. 2016;5:1127–1135. doi: 10.1021/acssynbio.6b00029. [DOI] [PubMed] [Google Scholar]
- 100.Amstutz, P. et al. Common Workflow Language, v1.0 https://www.commonwl.org/ (2016).
- 101.O’Connor BD, et al. The Dockstore: enabling modular, community-focused sharing of Docker-based genomics tools and workflows. F1000Research. 2017;6:52. doi: 10.12688/f1000research.10137.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.Goble CA, et al. myExperiment: a repository and social network for the sharing of bioinformatics workflows. Nucleic Acids Res. 2010;38:W677–W682. doi: 10.1093/nar/gkq429. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 103.Missier, P., Belhajjame, K. & Cheney, J. The W3C PROV family of specifications for modelling provenance metadata. In Proc. 16th International Conference on Extending Database Technology 773–776 (ACM, 2013).
- 104.Broad Institute. The Workflow Description Language and cromwell, https://github.com/broadinstitute/cromwell (2019).
- 105.Goecks J, Nekrutenko A, Taylor J. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010;11:R86. doi: 10.1186/gb-2010-11-8-r86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 106.Tommaso PD, et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 2017;35:316. doi: 10.1038/nbt.3820. [DOI] [PubMed] [Google Scholar]
- 107.Vivian J, et al. Toil enables reproducible, open source, big biomedical data analyses. Nat. Biotechnol. 2017;35:314. doi: 10.1038/nbt.3772. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 108.Bergmann FT, et al. COMBINE archive and OMEX format: one file to share all information to reproduce a modeling project. BMC Bioinform. 2014;15:369. doi: 10.1186/s12859-014-0369-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 109.Soltesz, S. Pötzl, H., Fiuczynski, M. E., Bavier, A. & Peterson, L. Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors. in ACM SIGOPS Operating Systems Review, Vol. 41, 275–287 (ACM, 2007).
- 110.GitHub, Inc. GitHub guides: Mastering issues, https://guides.github.com/features/issues/ (2019).
- 111.Fisher, J., Koning, D. & Ludwigsen, A. P. Utilizing Atlassian Jira for large-scale software development management. Technical report (Lawrence Livermore National Lab. (LLNL), Livermore, CA, 2013).
- 112.Sadowski MI, Grant C, Fell TS. Harnessing QbD, programming languages, and automation for reproducible biology. Trends Biotechnol. 2016;34:214–227. doi: 10.1016/j.tibtech.2015.11.006. [DOI] [PubMed] [Google Scholar]
- 113.Prasad PJ, Bodhe GL. Trends in laboratory information management system. Chemometr. Intell. Lab. Syst. 2012;118:187–192. doi: 10.1016/j.chemolab.2012.07.001. [DOI] [Google Scholar]
- 114.Keller, B., Vrana, J., Miller, A., Newman, G. & Klavins, E. Aquarium: the laboratory operating system version 2.6.0. https://github.com/klavinslab/aquarium (2019).
- 115.Synthace. Antha. https://github.com/antha-lang
- 116.Miles B, Lee PL. Achieving reproducibility and closed-loop automation in biological experimentation with an IoT-enabled lab of the future. SLAS Technol. 2018;23:432–439. doi: 10.1177/2472630318784506. [DOI] [PubMed] [Google Scholar]
- 117.Yang Y, Nemhauser JL, Klavins E. Synthetic bistability and differentiation in yeast. ACS Synth. Biol. 2019;8:929–936. doi: 10.1021/acssynbio.8b00524. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 118.Wilkinson MD, et al. The FAIR guiding principles for scientific data management and stewardship. Sci. Data. 2016;3:160018. doi: 10.1038/sdata.2016.18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 119.Wolstencroft K, et al. FAIRDOMHub: a repository and collaboration environment for sharing systems biology research. Nucleic Acids Res. 2016;45:D404–D407. doi: 10.1093/nar/gkw1032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 120.Morrell WC, et al. The experiment data depot: a web-based software tool for biological experimental data storage, sharing, and visualization. ACS Synth. Biol. 2017;6:2248–2259. doi: 10.1021/acssynbio.7b00204. [DOI] [PubMed] [Google Scholar]
- 121.McLaughlin JA, et al. SynBioHub: a standards-enabled design repository for synthetic biology. ACS Synth. Biol. 2018;7:682–688. doi: 10.1021/acssynbio.7b00403. [DOI] [PubMed] [Google Scholar]
- 122.Opensource.org. Open Source Initiative. https://opensource.org/ (2019).
- 123.Lessig L. The creative commons. Fla. L. Rev. 2003;55:763. [Google Scholar]
- 124.Nguyen T. Science commons: material transfer agreement project. Innov. Technol. Gov. Glob. 2007;2:137–143. [Google Scholar]
- 125.Kahl Linda, Molloy Jennifer, Patron Nicola, Matthewman Colette, Haseloff Jim, Grewal David, Johnson Richard, Endy Drew. Opening options for material transfer. Nature Biotechnology. 2018;36(10):923–927. doi: 10.1038/nbt.4263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 126.Chandersekaran, C., Simpson, W. & Trice, A. Cross-domain solutions in an era of information sharing. in The 1st International Multi-conference on Engineering and Technological Innovation: IMET2008, Orlando, FL, Vol. 1, 313–318 (2008).
- 127.Sun J, Fang Y. Cross-domain data sharing in distributed electronic health record systems. IEEE Trans. Parall. Distr. Syst. 2009;21:754–764. [Google Scholar]