Figure - PMC

Skip to main content

An official website of the United States government

Here's how you know

Here's how you know

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

View full-text article in PMC

. 2020 Dec 8;11:6293. doi: 10.1038/s41467-020-19612-0

Search in PMC
Search in PubMed
View in NLM Catalog
Add to search

© The Author(s) 2020

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

PMC Copyright notice

Fig. 1 — a The Addgene plasmid repository (bottom) provides a model through which to study the deployment scenario (top) for genetic engineering attribution. In the model scenario, research laboratories engineer organisms and share their genetic designs with the research community by depositing the DNA sequence and phenotypic metadata information to Addgene. In the corresponding deployment scenario, a genetically engineered organism of unknown origin is obtained, for example, from an environmental sample, lab accident, misuse incident, or case of disputed authorship. By characterizing this sample in the laboratory with sequencing and phenotype experiments, the investigator identifies the engineered sequence and phenotype information. In either case, the sequence and phenotype information are input to an attribution model which predicts the probability the organism originated from individuals connected to a set of known labs, enabling further conventional investigation; * indicates a hypothetical “best match” predicted by the imagined model. Above, the same information may be input to a wider toolkit of methods which provide actionable leads and characterization of the sample to support the investigator. b DNA motifs are inferred through the Byte Pair Encoding (BPE) algorithm²⁵, which successively merges the most frequently occurring pairs of tokens to compress input sequences into a vocabulary larger than the traditional four DNA bases. Progressively, sequences become shorter and new motif tokens become longer. c BPE on the training set of Addgene plasmid sequences. The x axis shows the tokens rank-ordered by frequency in the sequence set (decreasing). The y axis shows token length, in base pairs. Example tokens (bold, numbered) are linked to biologically meaningful sequence motifs. d The deteRNNt method takes 100 random subsequences from the plasmid encoded with BPE and embeds them into a continuous space via a learned word embedding⁵⁹ matrix layer. These (potentially variable-length) sequences are processed by an LSTM network. We average the predictions from each subsequence to obtain a softmax probability that the plasmid originated in a given lab. e Top-k prediction accuracy on the test set. Compared: deteRNNt, deteRNNt trained without phenotype, BLASTn, CNN deep-learning state-of-the-art method²¹, a baseline guessing the most abundant labs from the training set, and guessing uniformly randomly (so low, cannot be seen).* indicates P < 10⁻¹⁰, by Welch’s two-tailed t test on n = 30 × 50% bootstrap replicates compared to BLAST. f Top-k prediction accuracy on test set with and without phenotype information. * indicates P < 10⁻¹⁰, by Welch’s two-tailed t test on n = 30 × 50% bootstrap replicates compared to no phenotype.